# Coding and Thematic Analysis

Created by [Matt Artz](https://www.mattartz.me/) — Advancing AI Anthropology through computational approaches to qualitative research.

## What This Tool Does

This notebook analyzes qualitative data by coding it using both deductive and inductive approaches, and then building those codes up into themes. Rather than manually applying codes to hundreds of text segments and then building themes iteratively, you receive coded data as well as suggested themes.

Building on the foundations established by the Qualitative Codebook Builder and Interview Transcript Semantic Chunker, the notebook processes chunked text by applying the pre-defined deductive codes while inductively identifying emergent codes, and then it constructs themes from both approaches.

## Key Features

- **Data Integration**: Imports codebooks and transcripts from other AI Anthropology Toolkit notebooks
- **Three Coding Approaches**: Choose between deductive, inductive, or hybrid coding methodologies
- **Code Application**: Uses Claude AI for coding across text segments
- **Theme Construction**: Builds hierarchical themes from coded data
- **Export Options**: Produces Excel workbooks and formatted Word documents with themes


## Workflow

1. Import deductive codebook from Qualitative Codebook Builder (CSV format)
2. Import interview chunks from Interview Transcript Processor (CSV format)
3. Configure analysis parameters including coding approach and AI model
4. Apply deductive codes using predefined codes (if selected)
5. Discover inductive codes emerging from the data (if selected)
6. Integrate codes and build themes
7. Generate visualizations and export results

## Applications

This tool supports qualitative research from dissertation fieldwork to applied research projects. It's particularly useful for computational analysis using the tools in my AI Anthropology Toolkit, enabling researchers to maintain rigor while working at scale.

## Methodological Positioning

This represents "computational anthropology" - using AI to enhance rather than replace traditional qualitative methods. The tool handles the time-intensive coding and thematic analysis process of interpretation and meaning-making.

**Important:** AI can assist with pattern recognition and consistency, but human expertise remains essential for contextual understanding.

## Target Audience

Designed for anthropologists and qualitative researchers working with qualitative data—from graduate students managing thesis interviews to research teams processing large datasets for applied projects.

## Technical Approach

Combines natural language processing with anthropological coding principles, using Claude AI to apply codes consistently while discovering emergent patterns through computational analysis.

## Contributing to AI Anthropology

This notebook contributes to the emerging field of AI Anthropology—which combines studying AI as cultural artifact, using AI to enhance ethnographic research, and applying anthropological insights to AI development (Artz, forthcoming). By open-sourcing these tools, this work advances the collective capacity of anthropologists to work effectively with computational methods.

## AI Anthropology Toolkit

This tool is part of a growing suite of computational resources for anthropological research:


- **[Qualitative Codebook Builder](https://github.com/MattArtzAnthro/Qualitative_Codebook_Builder)** - AI-assisted development of qualitative coding frameworks
- **[Interview Transcript Semantic Chunker](https://github.com/MattArtzAnthro/Interview_Transcript_Semantic_Chunker)** - AI-assisted segmentation of interview transcripts
- **[Coding and Thematic Analysis](https://github.com/MattArtzAnthro/Coding_and_Thematic_Analysis)** (this tool) - AI-assisted coding and thematic analysis of qualtiative data

*Additional tools will be added to this toolkit as they are developed.*


<br>

---

<br>

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You may remix, adapt, and build upon the material for non-commercial purposes, provided you credit Matt Artz and link to the repository.

**Full license details**: https://creativecommons.org/licenses/by-nc/4.0/

## Attribution   

If you use or adapt this project in your work, please cite:


> Built with the Coding and Thematic Analysis (Matt Artz, 2025) — https://github.com/MattArtzAnthro/Coding_and_Thematic_Analysis


## Citation

If you use this tool in your academic research, please cite:


> Artz, Matt. 2025. Coding and Thematic Analysis. Software.
Zenodo. https://doi.org/10.5281/zenodo.15832611

## Refrences
Artz, Matt. Forthcoming. “AI Anthropology: The Future of Applied Anthropological Practice.” In Routledge Handbook of Applied Anthropology, edited by Christina Wasson, Edward B. Liebow, Karine L. Narahara, Ndukuyakhe Ndlovu, and Alaka Wali. New York: Routledge.

## Setup and Installation

Install required Python packages and import necessary libraries for mixed-method qualitative analysis. This includes AI integration, data processing, visualization, and export capabilities.

In [None]:
# Install required packages
!pip install anthropic pandas numpy matplotlib seaborn wordcloud plotly networkx openpyxl python-docx ipywidgets scikit-learn -q

import pandas as pd
import numpy as np
import re
import json
import time
import os
from collections import Counter, defaultdict
from datetime import datetime
from typing import List, Dict, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go
import networkx as nx

# Document generation
from docx import Document
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.style import WD_STYLE_TYPE

# UI components
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output

# Anthropic for Claude API
import anthropic

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("📊 AI-Assisted Qualitative Coding Analyzer")
print("=" * 50)
print(f"Setup completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n✅ All dependencies installed successfully!")

## Configuration Interface

Configure all analysis parameters including file uploads, coding approach, AI model selection, and API settings through an intuitive interface.

In [None]:
# Global configuration storage
config = {
    'codebook_df': None,
    'transcript_df': None,
    'coding_approach': 'hybrid',
    'ai_model': 'claude-4-sonnet-20250514',
    'api_key': '',
    'output_folder': 'coding_analysis_output',
    'timestamp': datetime.now().strftime('%Y%m%d_%H%M%S')
}

# Create output folder
os.makedirs(config['output_folder'], exist_ok=True)

# File upload widgets
codebook_upload = widgets.FileUpload(
    accept='.csv',
    multiple=False,
    description='Codebook CSV:',
    button_style='primary',
    style={'button_color': '#6096BA'}
)

transcript_upload = widgets.FileUpload(
    accept='.csv',
    multiple=False,
    description='Transcript CSV:',
    button_style='primary',
    style={'button_color': '#6096BA'}
)

# Coding approach selection
coding_approach = widgets.RadioButtons(
    options=['Deductive Only', 'Inductive Only', 'Hybrid (Both)'],
    value='Hybrid (Both)',
    description='',
    style={'description_width': 'initial'}
)

# AI model selection
ai_model_dropdown = widgets.Dropdown(
    options=[
        ('Claude 4 Sonnet (Latest)', 'claude-4-sonnet-20250514'),
        ('Claude 3.5 Sonnet', 'claude-3-5-sonnet-20241022')
    ],
    value='claude-4-sonnet-20250514',
    description='Claude Model:',
    style={'description_width': 'initial'}
)

# API key input
api_key_input = widgets.Password(
    placeholder='Enter your Anthropic API key',
    description='API Key:',
    style={'description_width': 'initial'}
)

# Status output
status_output = widgets.Output()

# Process buttons
test_button = widgets.Button(
    description='🧪 Test Setup',
    button_style='info',
    style={'button_color': '#A3CEF1'}
)

process_button = widgets.Button(
    description='✅ Apply Configuration',
    button_style='success',
    style={'button_color': '#6096BA'}
)

def test_setup(b):
    with status_output:
        clear_output(wait=True)
        print("🧪 Testing configuration...")

        # Test file uploads
        codebook_ok = bool(codebook_upload.value)
        transcript_ok = bool(transcript_upload.value)
        api_ok = bool(api_key_input.value)

        print(f"📊 Codebook file: {'✅' if codebook_ok else '❌'}")
        print(f"📝 Transcript file: {'✅' if transcript_ok else '❌'}")
        print(f"🔑 API Key: {'✅' if api_ok else '❌'}")

        if codebook_ok and transcript_ok and api_ok:
            print("\n✅ All tests passed! Ready to start analysis.")
        else:
            print("\n❌ Please address the issues above before proceeding.")

def load_files(b):
    with status_output:
        clear_output(wait=True)

        try:
            # Load codebook
            if codebook_upload.value:
                content = list(codebook_upload.value.values())[0]['content']
                config['codebook_df'] = pd.read_csv(pd.io.common.BytesIO(content))
                print(f"✅ Codebook loaded: {len(config['codebook_df'])} codes")

                # Validate codebook columns
                required_cols = ['code_label', 'definition', 'inclusion_criteria',
                               'exclusion_criteria', 'example_1', 'example_2']
                missing_cols = [col for col in required_cols if col not in config['codebook_df'].columns]
                if missing_cols:
                    print(f"⚠️ Warning: Missing columns in codebook: {missing_cols}")
            else:
                print("❌ Please upload a codebook CSV file")
                return

            # Load transcripts
            if transcript_upload.value:
                content = list(transcript_upload.value.values())[0]['content']
                config['transcript_df'] = pd.read_csv(pd.io.common.BytesIO(content))
                print(f"✅ Transcripts loaded: {len(config['transcript_df'])} chunks")

                # Validate transcript columns
                if 'chunk_id' not in config['transcript_df'].columns or 'text' not in config['transcript_df'].columns:
                    print("❌ Error: Transcript file must have 'chunk_id' and 'text' columns")
                    return
            else:
                print("❌ Please upload a transcript CSV file")
                return

            # Update configuration
            config['coding_approach'] = coding_approach.value.lower()
            config['ai_model'] = ai_model_dropdown.value
            config['api_key'] = api_key_input.value

            # Validate API key
            if not config['api_key']:
                print("❌ Please enter your Anthropic API key")
                return

            print("\n✅ Configuration complete! Ready to start analysis.")
            print(f"\n📁 Output folder: {config['output_folder']}")
            print(f"🔧 Coding approach: {config['coding_approach']}")
            print(f"🧠 AI Model: {ai_model_dropdown.label}")

        except Exception as e:
            print(f"❌ Error loading files: {e}")

test_button.on_click(test_setup)
process_button.on_click(load_files)

# Header section with title and how-to-use
header_section = widgets.HTML("""
<div style="background-color: #E7ECEF; padding: 20px; border-radius: 10px; border-left: 5px solid #274C77; margin-bottom: 20px;">
    <h2 style="color: #274C77; margin: 0 0 10px 0;">🎯 Qualitative Coding Analyzer</h2>
    <p><strong>Welcome!</strong> This tool helps social scientists systematically apply codes to interview transcripts and qualitative data using AI-powered analysis.</p>

    <h4 style="color: #274C77; margin: 15px 0 10px 0;">📋 How to Use:</h4>
    <ol style="margin: 0; padding-left: 20px;">
        <li><strong>Configure:</strong> Upload your files and adjust analysis settings below</li>
        <li><strong>Upload:</strong> Add your codebook and chunked transcript CSV files</li>
        <li><strong>Process:</strong> Run the AI-powered coding analysis</li>
        <li><strong>Export:</strong> Download your coded data with applied qualitative codes</li>
    </ol>
</div>
""")

# Configuration guide section
guide_section = widgets.HTML("""
<div style="background-color: #A3CEF1; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h4 style="color: #274C77; margin: 0 0 15px 0;">📚 Configuration Guide</h4>

    <div style="display: flex; gap: 20px; flex-wrap: wrap;">
        <div style="flex: 1; min-width: 300px;">
            <strong>• Coding Approaches:</strong>
            <ul style="margin: 5px 0 15px 20px;">
                <li><strong>Deductive:</strong> Apply pre-defined codes from your codebook - best when you have established theoretical frameworks</li>
                <li><strong>Inductive:</strong> Generate new codes by discovering emergent themes - ideal for exploratory research</li>
                <li><strong>Hybrid:</strong> Combine both approaches to apply existing codes and discover new themes (recommended)</li>
            </ul>

            <strong>• AI Models:</strong>
            <ul style="margin: 5px 0 15px 20px;">
                <li><strong>Claude 4 Sonnet:</strong> Latest model with enhanced reasoning capabilities (recommended)</li>
                <li><strong>Claude 3.5 Sonnet:</strong> Proven performance with excellent qualitative analysis</li>
            </ul>
        </div>

        <div style="flex: 1; min-width: 300px;">
            <strong>• Input File Requirements:</strong>
            <ul style="margin: 5px 0 15px 20px;">
                <li><strong>Codebook CSV:</strong> Input file from Codebook Generator notebook (or manual creation) with code_label, definition, inclusion_criteria, exclusion_criteria, example_1, example_2 columns</li>
                <li><strong>Transcript CSV:</strong> Input file from Interview Transcript Semantic Chunker notebook with chunk_id and text columns</li>
                <li><strong>Format:</strong> UTF-8 encoded CSV files with proper headers</li>
            </ul>
        </div>
    </div>
</div>
""")

# File uploads section with actions
file_uploads_section = widgets.VBox([
    widgets.HTML("<h4 style='color: #274C77;'>📁 File Uploads</h4>"),
    codebook_upload,
    transcript_upload,
    widgets.HTML("<br>"),
    widgets.HTML("<h4 style='color: #274C77;'>🚀 Actions</h4>"),
    widgets.HBox([process_button, test_button], layout=widgets.Layout(gap='10px'))
])

# Coding approach section
coding_section = widgets.VBox([
    widgets.HTML("<h4 style='color: #274C77;'>🔍 Coding Approach</h4>"),
    coding_approach
])

# AI Configuration section
ai_section = widgets.VBox([
    widgets.HTML("<h4 style='color: #274C77;'>🧠 AI Configuration</h4>"),
    ai_model_dropdown,
    widgets.HTML("<br>"),
    api_key_input
])

# Main configuration container - three columns in one row with even spacing
main_config = widgets.HBox([
    file_uploads_section,
    coding_section,
    ai_section
], layout=widgets.Layout(justify_content='space-between', width='75%'))

# Display the complete interface
display(widgets.VBox([
    header_section,
    guide_section,
    main_config,
    status_output
]))

## Core Analysis Classes

Define the core classes for deductive coding, AI integration, and analysis functions that power the coding process.

In [None]:
class DeductiveCoder:
    """
    Handles deductive coding using a predefined codebook.
    """

    def __init__(self, codebook_df):
        self.codebook_df = codebook_df
        self.code_dict = self._build_code_dict()
        self.coding_history = []

    def _build_code_dict(self):
        """Build a dictionary structure from the codebook."""
        code_dict = {}

        for _, row in self.codebook_df.iterrows():
            code = row['code_label']
            code_dict[code] = {
                'definition': row['definition'],
                'inclusion': row.get('inclusion_criteria', ''),
                'exclusion': row.get('exclusion_criteria', ''),
                'examples': [row.get('example_1', ''), row.get('example_2', '')]
            }

        return code_dict

    def display_codebook_summary(self):
        """Display an organized summary of the codebook."""
        print("\n📚 CODEBOOK REFERENCE")
        print("=" * 60)

        for code, details in self.code_dict.items():
            print(f"\n🏷️  {code}")
            print(f"   Definition: {details['definition']}")
            if details['inclusion']:
                print(f"   ✓ Include: {details['inclusion']}")
            if details['exclusion']:
                print(f"   ✗ Exclude: {details['exclusion']}")

    def validate_codes(self, codes_list):
        """Validate that provided codes exist in the codebook."""
        valid_codes = []
        invalid_codes = []

        all_codes = list(self.code_dict.keys())

        for code in codes_list:
            if code.strip() in all_codes:
                valid_codes.append(code.strip())
            else:
                invalid_codes.append(code.strip())

        return {
            'valid': valid_codes,
            'invalid': invalid_codes,
            'all_valid': len(invalid_codes) == 0
        }


class ClaudeAutoCoder:
    """
    Automated coding using Claude API or alternative methods.
    """

    def __init__(self, api_key: str, codebook_df: pd.DataFrame, coder: DeductiveCoder, use_ai: bool = True):
        self.use_ai = use_ai
        self.codebook_df = codebook_df
        self.coder = coder
        self.coding_history = []

        if use_ai:
            self.client = anthropic.Anthropic(api_key=api_key)
            self.coding_prompt = self._build_coding_prompt()
        else:
            self.client = None
            self.coding_prompt = None

    def _build_coding_prompt(self) -> str:
        """Build the system prompt for Claude with the complete codebook."""
        codebook_text = "DEDUCTIVE CODING CODEBOOK:\n\n"

        for code, details in self.coder.code_dict.items():
            codebook_text += f"CODE: {code}\n"
            codebook_text += f"Definition: {details['definition']}\n"
            if details['inclusion']:
                codebook_text += f"Include when: {details['inclusion']}\n"
            if details['exclusion']:
                codebook_text += f"Exclude when: {details['exclusion']}\n"
            codebook_text += "\n"

        prompt = f"""You are a qualitative research assistant specializing in deductive coding. Your task is to analyze text segments and identify which codes from the codebook apply.

{codebook_text}

CODING INSTRUCTIONS:
1. Read each text segment carefully
2. Apply ALL relevant codes from the codebook
3. Only use codes that are explicitly defined above
4. Return codes as a comma-separated list (e.g., "CODE1,CODE2,CODE3")
5. If no codes apply, return "NO_CODES"
6. Be consistent - similar content should receive similar codes
7. Focus on manifest content (what is explicitly stated)

Return only the comma-separated codes, no explanation needed."""

        return prompt

    def code_single_chunk(self, text: str, chunk_id: str = None) -> Dict:
        """Code a single text chunk."""
        if pd.isna(text) or str(text).strip() == '':
            return {
                'chunk_id': chunk_id,
                'codes': '',
                'valid': True,
                'error': None
            }

        if self.use_ai:
            return self._code_with_ai(text, chunk_id)
        else:
            return self._code_without_ai(text, chunk_id)

    def _code_with_ai(self, text: str, chunk_id: str) -> Dict:
        """Code using Claude API."""
        try:
            response = self.client.messages.create(
                model=config['ai_model'],
                max_tokens=150,
                temperature=0.1,
                system=self.coding_prompt,
                messages=[{
                    "role": "user",
                    "content": f"Code this text: {text}"
                }]
            )

            raw_codes = response.content[0].text.strip()

            if raw_codes.upper() == "NO_CODES":
                return {
                    'chunk_id': chunk_id,
                    'codes': '',
                    'valid': True,
                    'error': None
                }

            # Validate codes
            codes_list = [code.strip() for code in raw_codes.split(',')]
            validation = self.coder.validate_codes(codes_list)

            return {
                'chunk_id': chunk_id,
                'codes': ','.join(validation['valid']),
                'valid': validation['all_valid'],
                'error': f"Invalid codes removed: {validation['invalid']}" if validation['invalid'] else None
            }

        except Exception as e:
            return {
                'chunk_id': chunk_id,
                'codes': '',
                'valid': False,
                'error': f"API Error: {str(e)}"
            }

    def _code_without_ai(self, text: str, chunk_id: str) -> Dict:
        """Simple keyword matching for non-AI coding."""
        text_lower = text.lower()
        matched_codes = []

        for code, details in self.coder.code_dict.items():
            # Simple keyword matching from definition and examples
            keywords = []
            keywords.extend(details['definition'].lower().split())
            keywords.extend([ex.lower() for ex in details['examples'] if ex])

            # Check if any keyword appears in text
            if any(keyword in text_lower for keyword in keywords if len(keyword) > 3):
                matched_codes.append(code)

        return {
            'chunk_id': chunk_id,
            'codes': ','.join(matched_codes),
            'valid': True,
            'error': None
        }

    def code_batch(self, df: pd.DataFrame, text_column: str = 'text',
                   chunk_id_column: str = 'chunk_id', delay_seconds: float = 1.0) -> pd.DataFrame:
        """Code a batch of text chunks."""
        total_chunks = len(df)

        print(f"\n🤖 Starting Deductive Coding")
        print(f"Processing {total_chunks} text chunks")
        print("-" * 50)

        # Add columns for deductive coding
        if 'Deductive_Codes' not in df.columns:
            df['Deductive_Codes'] = ''
        if 'Coding_Status' not in df.columns:
            df['Coding_Status'] = ''

        successful_codes = 0
        failed_codes = 0

        for idx, row in df.iterrows():
            text = row[text_column]
            chunk_id = row[chunk_id_column]

            print(f"Coding chunk {idx+1}/{total_chunks} ({((idx+1)/total_chunks*100):.1f}%)", end=" ")

            result = self.code_single_chunk(text, chunk_id)

            # Update DataFrame
            df.at[idx, 'Deductive_Codes'] = result['codes']
            df.at[idx, 'Coding_Status'] = 'Deductive_Coded' if result['codes'] else 'No_Deductive_Codes'

            if result['valid'] and not result['error']:
                successful_codes += 1
                print("✅")
            else:
                failed_codes += 1
                print(f"⚠️  {result['error']}")

            if self.use_ai and idx < total_chunks - 1:
                time.sleep(delay_seconds)

        print(f"\n✅ Deductive coding completed!")
        print(f"Successfully coded: {successful_codes}")
        print(f"Failed/partial: {failed_codes}")

        return df

## Inductive Coding Classes

Define classes for discovering emergent themes and patterns not captured in the deductive codebook.

In [None]:
class ClaudeInductiveCoder:
    """
    Use Claude API for inductive coding to discover emergent themes.
    """

    def __init__(self, claude_coder: ClaudeAutoCoder, coded_df: pd.DataFrame):
        self.claude_coder = claude_coder
        self.coded_df = coded_df
        self.client = claude_coder.client
        self.use_ai = claude_coder.use_ai
        self.discovered_codes = {}

    def generate_inductive_codes(self, sample_size: int = 50) -> Dict:
        """Discover emergent codes from the data."""
        print(f"\n🔍 GENERATING INDUCTIVE CODES")
        print("=" * 40)

        # Sample chunks for analysis
        all_chunks = self.coded_df[self.coded_df['text'].notna()]
        sample_chunks = all_chunks.sample(min(sample_size, len(all_chunks)))

        print(f"📊 Analyzing {len(sample_chunks)} chunks for emergent patterns...")

        if self.use_ai:
            return self._generate_with_ai(sample_chunks)
        else:
            return self._generate_without_ai(sample_chunks)

    def _generate_with_ai(self, sample_chunks: pd.DataFrame) -> Dict:
        """Generate inductive codes using AI."""
        # Prepare chunks with their deductive codes for context
        chunks_text = ""
        for idx, row in sample_chunks.iterrows():
            text = str(row['text'])[:400] + "..." if len(str(row['text'])) > 400 else str(row['text'])
            deductive_codes = row.get('Deductive_Codes', '')
            chunks_text += f"\nChunk {row['chunk_id']}:\n{text}\nDeductive codes: {deductive_codes}\n---\n"

        inductive_prompt = f"""You are conducting INDUCTIVE CODING on interview transcripts.
Your task is to identify EMERGENT THEMES that are NOT captured by the existing deductive codes.

SAMPLE CHUNKS FOR ANALYSIS:
{chunks_text}

TASK: Identify 8-12 EMERGENT INDUCTIVE CODES that capture important patterns NOT covered by deductive codes.

For each code provide:
**INDUCTIVE CODE: [SHORT_NAME]**
Definition: [Clear description]
Rationale: [Why this is important]
Example: "[Direct quote]"
When to Apply: [Clear criteria]

Ensure there is a blank line between codes."""

        try:
            response = self.client.messages.create(
                model=config['ai_model'],
                max_tokens=3000,
                temperature=0.4,
                messages=[{
                    "role": "user",
                    "content": inductive_prompt
                }]
            )

            inductive_analysis = response.content[0].text
            self.discovered_codes = self._parse_inductive_codes(inductive_analysis)

            print(f"✅ Found {len(self.discovered_codes)} inductive codes")

            return {
                'inductive_analysis': inductive_analysis,
                'discovered_codes': self.discovered_codes,
                'sample_size': len(sample_chunks)
            }

        except Exception as e:
            print(f"❌ Error in inductive analysis: {e}")
            return {'error': str(e)}

    def _generate_without_ai(self, sample_chunks: pd.DataFrame) -> Dict:
        """Generate inductive codes using frequency analysis."""
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.cluster import KMeans

        # Simple TF-IDF based theme discovery
        texts = sample_chunks['text'].fillna('').tolist()

        try:
            vectorizer = TfidfVectorizer(max_features=50, stop_words='english', ngram_range=(1, 3))
            tfidf_matrix = vectorizer.fit_transform(texts)

            # Get top terms
            feature_names = vectorizer.get_feature_names_out()

            # Simple clustering
            n_clusters = min(8, len(texts) // 5)
            kmeans = KMeans(n_clusters=n_clusters, random_state=42)
            clusters = kmeans.fit_predict(tfidf_matrix)

            # Generate codes from clusters
            for i in range(n_clusters):
                cluster_terms = []
                cluster_docs = [texts[j] for j in range(len(texts)) if clusters[j] == i]

                if cluster_docs:
                    # Get representative terms
                    cluster_vectorizer = TfidfVectorizer(max_features=5, stop_words='english')
                    cluster_tfidf = cluster_vectorizer.fit_transform(cluster_docs)
                    terms = cluster_vectorizer.get_feature_names_out()

                    code_name = f"THEME_{i+1}_{terms[0].upper()}"
                    self.discovered_codes[code_name] = {
                        'definition': f"Theme related to {', '.join(terms[:3])}",
                        'rationale': "Emerged from clustering analysis",
                        'example': cluster_docs[0][:100] + "...",
                        'application': f"Apply when text mentions {terms[0]}"
                    }

            print(f"✅ Found {len(self.discovered_codes)} inductive themes through clustering")

            return {
                'discovered_codes': self.discovered_codes,
                'sample_size': len(sample_chunks)
            }

        except Exception as e:
            print(f"❌ Error in clustering: {e}")
            # Fallback to simple frequency analysis
            return self._simple_frequency_codes(sample_chunks)

    def _simple_frequency_codes(self, sample_chunks: pd.DataFrame) -> Dict:
        """Fallback method using simple frequency analysis."""
        from collections import Counter
        import re

        # Extract common phrases
        all_text = ' '.join(sample_chunks['text'].fillna('').tolist()).lower()
        words = re.findall(r'\b\w+\b', all_text)

        # Filter common words
        stop_words = {'the', 'is', 'at', 'which', 'on', 'and', 'a', 'an', 'as', 'are', 'was', 'were', 'i', 'you', 'he', 'she', 'it', 'they', 'we'}
        words = [w for w in words if w not in stop_words and len(w) > 3]

        # Get most common words
        word_freq = Counter(words).most_common(20)

        # Create simple codes
        for i, (word, freq) in enumerate(word_freq[:8]):
            code_name = f"FREQ_{word.upper()}"
            self.discovered_codes[code_name] = {
                'definition': f"References to {word} (appeared {freq} times)",
                'rationale': "High frequency term in the data",
                'example': f"Text containing '{word}'",
                'application': f"Apply when text mentions {word}"
            }

        print(f"✅ Found {len(self.discovered_codes)} frequency-based codes")

        return {
            'discovered_codes': self.discovered_codes,
            'sample_size': len(sample_chunks)
        }

    def _parse_inductive_codes(self, analysis_text):
        """Parse inductive codes from Claude's analysis."""
        codes = {}

        if not analysis_text:
            return codes

        # Use regex to find each code block
        code_blocks = re.split(r'\*\*INDUCTIVE CODE: (.+?)\*\*\s*\n', analysis_text, flags=re.DOTALL | re.IGNORECASE)

        if len(code_blocks) < 2:
            return codes

        for i in range(1, len(code_blocks), 2):
            code_name = code_blocks[i].strip()
            block_content = code_blocks[i+1].strip() if i+1 < len(code_blocks) else ''

            codes[code_name] = {
                'definition': '',
                'rationale': '',
                'example': '',
                'application': ''
            }

            # Extract fields
            definition_match = re.search(r'Definition: (.+?)(?:\nRationale:|\nExample:|\nWhen to Apply:|$)', block_content, re.DOTALL)
            if definition_match:
                codes[code_name]['definition'] = definition_match.group(1).strip()

            rationale_match = re.search(r'Rationale: (.+?)(?:\nDefinition:|\nExample:|\nWhen to Apply:|$)', block_content, re.DOTALL)
            if rationale_match:
                codes[code_name]['rationale'] = rationale_match.group(1).strip()

            example_match = re.search(r'Example: (.+?)(?:\nDefinition:|\nRationale:|\nWhen to Apply:|$)', block_content, re.DOTALL)
            if example_match:
                codes[code_name]['example'] = example_match.group(1).strip()

            application_match = re.search(r'When to Apply: (.+?)(?:\nDefinition:|\nRationale:|\nExample:|$)', block_content, re.DOTALL)
            if application_match:
                codes[code_name]['application'] = application_match.group(1).strip()

        return codes

    def apply_inductive_codes(self, delay_seconds: float = 1.0) -> pd.DataFrame:
        """Apply discovered inductive codes to all chunks."""
        print(f"\n📝 APPLYING INDUCTIVE CODES")
        print("-" * 50)

        if not self.discovered_codes:
            print("❌ No inductive codes to apply.")
            return self.coded_df

        # Add inductive coding columns
        if 'Inductive_Codes' not in self.coded_df.columns:
            self.coded_df['Inductive_Codes'] = ''

        total_chunks = len(self.coded_df)
        successful = 0

        if self.use_ai:
            # Create prompt for applying codes
            codes_text = "\n".join([
                f"{code}: {details['definition']} (Apply when: {details['application']})"
                for code, details in self.discovered_codes.items()
            ])

            apply_prompt = f"""Apply these INDUCTIVE CODES to text chunks.

INDUCTIVE CODES:
{codes_text}

Instructions:
1. Apply ONLY codes that clearly match
2. Return codes as comma-separated list
3. If no codes apply, return "NONE"

Return ONLY the code names."""

            for idx, row in self.coded_df.iterrows():
                if pd.notna(row['text']):
                    text = str(row['text'])

                    print(f"Applying inductive codes {idx+1}/{total_chunks} ({(idx+1)/total_chunks*100:.1f}%)", end=" ")

                    try:
                        response = self.client.messages.create(
                            model=config['ai_model'],
                            max_tokens=100,
                            temperature=0.1,
                            system=apply_prompt,
                            messages=[{
                                "role": "user",
                                "content": f"Text: {text[:800]}"
                            }]
                        )

                        result = response.content[0].text.strip()

                        if result != "NONE":
                            self.coded_df.at[idx, 'Inductive_Codes'] = result
                            successful += 1
                            print("✅")
                        else:
                            print("⭕")

                        time.sleep(delay_seconds)

                    except Exception as e:
                        print(f"❌ Error: {e}")
                        continue
        else:
            # Apply codes using keyword matching
            for idx, row in self.coded_df.iterrows():
                if pd.notna(row['text']):
                    text = str(row['text']).lower()
                    matched_codes = []

                    for code, details in self.discovered_codes.items():
                        # Simple keyword matching
                        keywords = details['definition'].lower().split()
                        if any(keyword in text for keyword in keywords if len(keyword) > 3):
                            matched_codes.append(code)

                    if matched_codes:
                        self.coded_df.at[idx, 'Inductive_Codes'] = ','.join(matched_codes)
                        successful += 1

        print(f"\n✅ Applied inductive codes to {successful} chunks")

        # Update coding status
        for idx, row in self.coded_df.iterrows():
            has_deductive = pd.notna(row.get('Deductive_Codes', '')) and row.get('Deductive_Codes', '') != ''
            has_inductive = pd.notna(row.get('Inductive_Codes', '')) and row.get('Inductive_Codes', '') != ''

            if has_deductive and has_inductive:
                self.coded_df.at[idx, 'Coding_Status'] = 'Both_Deductive_Inductive'
            elif has_deductive:
                self.coded_df.at[idx, 'Coding_Status'] = 'Deductive_Only'
            elif has_inductive:
                self.coded_df.at[idx, 'Coding_Status'] = 'Inductive_Only'
            else:
                self.coded_df.at[idx, 'Coding_Status'] = 'No_Codes'

        return self.coded_df

## Analysis and Theme Building Functions

Functions to analyze code patterns, build themes, and generate insights from the coded data.

In [None]:
def integrate_coding_approaches(df_coded):
    """
    Create integrated view of all codes for theme building.
    """
    print("\n🔗 INTEGRATING CODING APPROACHES")
    print("-" * 50)

    # Add integration columns
    df_coded['All_Codes'] = ''
    df_coded['Total_Code_Count'] = 0

    for idx, row in df_coded.iterrows():
        all_codes = []

        # Add deductive codes
        if pd.notna(row.get('Deductive_Codes', '')) and row.get('Deductive_Codes', ''):
            deductive_codes = [c.strip() for c in str(row['Deductive_Codes']).split(',')]
            all_codes.extend(deductive_codes)

        # Add inductive codes (marked with _IND suffix)
        if pd.notna(row.get('Inductive_Codes', '')) and row.get('Inductive_Codes', ''):
            inductive_codes = [c.strip() + '_IND' for c in str(row['Inductive_Codes']).split(',')]
            all_codes.extend(inductive_codes)

        # Update integrated columns
        if all_codes:
            df_coded.at[idx, 'All_Codes'] = ', '.join(all_codes)
            df_coded.at[idx, 'Total_Code_Count'] = len(all_codes)

    # Analysis of integration
    integration_stats = {
        'total_chunks': len(df_coded),
        'coded_chunks': len(df_coded[df_coded['Total_Code_Count'] > 0]),
        'deductive_only': len(df_coded[df_coded['Coding_Status'] == 'Deductive_Only']),
        'inductive_only': len(df_coded[df_coded['Coding_Status'] == 'Inductive_Only']),
        'both_types': len(df_coded[df_coded['Coding_Status'] == 'Both_Deductive_Inductive']),
        'no_codes': len(df_coded[df_coded['Coding_Status'] == 'No_Codes'])
    }

    print("\n✅ Integration Complete:")
    print(f"• Total chunks: {integration_stats['total_chunks']}")
    print(f"• Coded chunks: {integration_stats['coded_chunks']}")
    print(f"• Deductive only: {integration_stats['deductive_only']}")
    print(f"• Inductive only: {integration_stats['inductive_only']}")
    print(f"• Both types: {integration_stats['both_types']}")
    print(f"• No codes: {integration_stats['no_codes']}")

    return df_coded, integration_stats


def analyze_code_patterns(df_coded):
    """
    Analyze patterns in the integrated codes.
    """
    print("\n📊 ANALYZING CODE PATTERNS")
    print("-" * 40)

    # Collect all codes
    all_codes_list = []
    deductive_codes_list = []
    inductive_codes_list = []
    code_combinations = defaultdict(int)

    for idx, row in df_coded.iterrows():
        if pd.notna(row['All_Codes']) and row['All_Codes']:
            codes = [c.strip() for c in str(row['All_Codes']).split(',')]
            all_codes_list.extend(codes)

            # Separate deductive and inductive
            deductive = [c for c in codes if not c.endswith('_IND')]
            inductive = [c for c in codes if c.endswith('_IND')]

            deductive_codes_list.extend(deductive)
            inductive_codes_list.extend(inductive)

            # Track combinations
            if len(codes) > 1:
                code_combinations[tuple(sorted(codes))] += 1

    # Calculate frequencies
    all_freq = Counter(all_codes_list)
    deductive_freq = Counter(deductive_codes_list)
    inductive_freq = Counter(inductive_codes_list)

    patterns = {
        'total_code_applications': len(all_codes_list),
        'unique_codes': len(set(all_codes_list)),
        'all_codes_frequency': dict(all_freq.most_common(20)),
        'deductive_frequency': dict(deductive_freq.most_common(15)),
        'inductive_frequency': dict(inductive_freq.most_common(15)),
        'frequent_combinations': dict(sorted(code_combinations.items(), key=lambda x: x[1], reverse=True)[:15])
    }

    print(f"\n📈 Pattern Analysis Results:")
    print(f"• Total code applications: {patterns['total_code_applications']}")
    print(f"• Unique codes used: {patterns['unique_codes']}")

    print(f"\n🏆 Top 10 Most Frequent Codes:")
    for code, freq in list(patterns['all_codes_frequency'].items())[:10]:
        code_type = "[IND]" if code.endswith('_IND') else "[DED]"
        print(f"  • {code} {code_type}: {freq} occurrences")

    return patterns


class IntegratedThemeBuilder:
    """
    Build themes from integrated deductive and inductive codes.
    """

    def __init__(self, claude_coder, coded_df, code_patterns, inductive_results):
        self.claude_coder = claude_coder
        self.coded_df = coded_df
        self.client = claude_coder.client if claude_coder else None
        self.use_ai = claude_coder.use_ai if claude_coder else False
        self.code_patterns = code_patterns
        self.inductive_results = inductive_results

    def build_themes(self):
        """Build hierarchical themes from the integrated analysis."""
        print("\n🎯 BUILDING THEMES FROM INTEGRATED CODES")
        print("=" * 50)

        if self.use_ai:
            return self._build_themes_with_ai()
        else:
            return self._build_themes_without_ai()

    def _build_themes_with_ai(self):
        """Build themes using AI."""
        # Get sample coded chunks
        pattern_examples = self._get_pattern_examples()

        theme_prompt = f"""You are a qualitative research expert building THEMES from mixed-method coding results.

CODING OVERVIEW:
- Total code applications: {self.code_patterns['total_code_applications']}
- Unique codes: {self.code_patterns['unique_codes']}

TOP DEDUCTIVE CODES:
{self._format_top_codes(self.code_patterns['deductive_frequency'], 10)}

TOP INDUCTIVE CODES:
{self._format_top_codes(self.code_patterns['inductive_frequency'], 10)}

SAMPLE CODED CHUNKS:
{pattern_examples}

TASK: Create 5-7 HIERARCHICAL THEMES that:
1. Integrate insights from both deductive and inductive codes
2. Have clear main themes with 2-3 sub-themes each
3. Are actionable and relevant

Format each theme as:

THEME [Number]: [Clear, Descriptive Name]
Core Concept: [2-3 sentences explaining what this theme captures]
Sub-themes:
  a) [Sub-theme name]: [Brief description]
  b) [Sub-theme name]: [Brief description]
Key Finding: [The main insight this theme reveals]
Evidence Strength: [Strong/Moderate/Emerging - based on frequency]
"""

        try:
            response = self.client.messages.create(
                model=config['ai_model'],
                max_tokens=4000,
                temperature=0.3,
                messages=[{
                    "role": "user",
                    "content": theme_prompt
                }]
            )

            themes_analysis = response.content[0].text
            print("✅ Themes successfully built!")

            return {
                'themes_analysis': themes_analysis,
                'code_patterns': self.code_patterns,
                'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            }

        except Exception as e:
            print(f"❌ Error building themes: {e}")
            return {'error': str(e)}

    def _build_themes_without_ai(self):
        """Build themes using frequency analysis."""
        themes_text = "THEMES GENERATED FROM FREQUENCY ANALYSIS\n\n"

        # Group codes by frequency
        top_codes = list(self.code_patterns['all_codes_frequency'].items())[:15]

        # Create simple themes based on most frequent codes
        theme_num = 1
        for i in range(0, len(top_codes), 3):
            theme_codes = top_codes[i:i+3]
            if theme_codes:
                main_code = theme_codes[0][0]
                themes_text += f"\nTHEME {theme_num}: {main_code.replace('_', ' ').title()}\n"
                themes_text += f"Core Concept: This theme encompasses patterns related to {main_code} "
                themes_text += f"which appeared {theme_codes[0][1]} times in the data.\n"
                themes_text += "Sub-themes:\n"

                for j, (code, freq) in enumerate(theme_codes[1:], 1):
                    themes_text += f"  {chr(96+j)}) {code}: Frequency: {freq}\n"

                themes_text += f"Key Finding: High frequency of {main_code} suggests its importance.\n"
                themes_text += "Evidence Strength: Strong (based on frequency)\n"
                theme_num += 1

        return {
            'themes_analysis': themes_text,
            'code_patterns': self.code_patterns,
            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        }

    def _get_pattern_examples(self):
        """Get example chunks for major patterns."""
        examples = []

        # Get examples of chunks with both deductive and inductive codes
        mixed_chunks = self.coded_df[self.coded_df['Coding_Status'] == 'Both_Deductive_Inductive'].head(3)

        for idx, row in mixed_chunks.iterrows():
            text_preview = str(row['text'])[:200] + "..."
            ded_codes = row.get('Deductive_Codes', 'None')
            ind_codes = row.get('Inductive_Codes', 'None')
            examples.append(f"\nChunk {row['chunk_id']}:\nText: {text_preview}\nDeductive: {ded_codes}\nInductive: {ind_codes}")

        return '\n'.join(examples) if examples else "No examples available"

    def _format_top_codes(self, freq_dict, limit):
        """Format top codes for the prompt."""
        lines = []
        for code, freq in list(freq_dict.items())[:limit]:
            lines.append(f"• {code}: {freq} occurrences")
        return '\n'.join(lines)

## Visualization Functions

Create visualizations including network graphs, word clouds, and statistical charts to illuminate patterns in the coded data.

In [None]:
def create_comprehensive_visualizations(df_coded, code_patterns, themes, coder):
    """
    Create all visualizations for the analysis.
    """
    print("\n📊 CREATING VISUALIZATIONS")
    print("-" * 40)

    # Set color scheme
    colors = {
        'primary': '#274C77',
        'secondary': '#6096BA',
        'accent': '#A3CEF1',
        'neutral': '#8B8C89',
        'background': '#E7ECEF'
    }

    # 1. Coding Method Distribution
    plt.figure(figsize=(10, 6))
    coding_status = df_coded['Coding_Status'].value_counts()
    plt.pie(coding_status.values, labels=coding_status.index, autopct='%1.1f%%',
            colors=[colors['primary'], colors['secondary'], colors['accent'], colors['neutral'], colors['background']],
            startangle=90)
    plt.title('Distribution of Coding Methods', fontsize=16, fontweight='bold', color=colors['primary'])
    plt.tight_layout()
    plt.savefig(f"{config['output_folder']}/coding_distribution.png", dpi=300, bbox_inches='tight')
    plt.show()

    # 2. Code Frequency Comparison
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

    # Deductive codes
    ded_codes = list(code_patterns['deductive_frequency'].items())[:15]
    if ded_codes:
        codes, freqs = zip(*ded_codes)
        ax1.barh(codes, freqs, color=colors['primary'])
        ax1.set_xlabel('Frequency', fontsize=12)
        ax1.set_title('Top 15 Deductive Codes', fontsize=14, fontweight='bold', color=colors['primary'])
        ax1.invert_yaxis()

    # Inductive codes
    ind_codes = list(code_patterns['inductive_frequency'].items())[:15]
    if ind_codes:
        codes, freqs = zip(*ind_codes)
        codes = [c.replace('_IND', '') for c in codes]
        ax2.barh(codes, freqs, color=colors['secondary'])
        ax2.set_xlabel('Frequency', fontsize=12)
        ax2.set_title('Top 15 Inductive Codes', fontsize=14, fontweight='bold', color=colors['primary'])
        ax2.invert_yaxis()

    plt.tight_layout()
    plt.savefig(f"{config['output_folder']}/code_frequencies.png", dpi=300, bbox_inches='tight')
    plt.show()

    # 3. Code Co-occurrence Heatmap
    create_code_cooccurrence_heatmap(df_coded, code_patterns, colors)

    # 4. Network Graph (if needed)
    if len(code_patterns['all_codes_frequency']) > 5:
        create_code_network_graph(df_coded, colors)

    print("\n✅ All visualizations created and saved!")


def create_code_cooccurrence_heatmap(df_coded, code_patterns, colors):
    """
    Create heatmap of code co-occurrences.
    """
    # Get top 20 codes overall
    top_codes = list(code_patterns['all_codes_frequency'].keys())[:20]

    # Build co-occurrence matrix
    matrix = pd.DataFrame(0, index=top_codes, columns=top_codes)

    for idx, row in df_coded.iterrows():
        if pd.notna(row['All_Codes']) and row['All_Codes']:
            codes = [c.strip() for c in str(row['All_Codes']).split(',')]
            codes = [c for c in codes if c in top_codes]

            for i, code1 in enumerate(codes):
                for code2 in codes[i:]:
                    matrix.loc[code1, code2] += 1
                    if code1 != code2:
                        matrix.loc[code2, code1] += 1

    # Create heatmap
    plt.figure(figsize=(14, 12))

    # Custom colormap
    cmap = sns.light_palette(colors['primary'], as_cmap=True)

    sns.heatmap(matrix, cmap=cmap, annot=True, fmt='d',
                cbar_kws={'label': 'Co-occurrence Count'},
                linewidths=0.5)

    plt.title('Code Co-occurrence Heatmap', fontsize=16, fontweight='bold',
              color=colors['primary'], pad=20)
    plt.tight_layout()
    plt.savefig(f"{config['output_folder']}/code_cooccurrence.png", dpi=300, bbox_inches='tight')
    plt.show()


def create_code_network_graph(df_coded, colors, min_cooccurrence=3):
    """
    Create network graph of code relationships.
    """
    G = nx.Graph()

    # Build network
    for idx, row in df_coded.iterrows():
        if pd.notna(row['All_Codes']) and row['All_Codes']:
            codes = [c.strip() for c in str(row['All_Codes']).split(',')]

            # Add nodes with type
            for code in codes:
                if code not in G:
                    node_type = 'inductive' if code.endswith('_IND') else 'deductive'
                    G.add_node(code, type=node_type)

            # Add edges
            for i in range(len(codes)):
                for j in range(i+1, len(codes)):
                    if G.has_edge(codes[i], codes[j]):
                        G[codes[i]][codes[j]]['weight'] += 1
                    else:
                        G.add_edge(codes[i], codes[j], weight=1)

    # Filter edges
    edges_to_remove = [(u, v) for u, v, d in G.edges(data=True)
                       if d['weight'] < min_cooccurrence]
    G.remove_edges_from(edges_to_remove)
    G.remove_nodes_from(list(nx.isolates(G)))

    if len(G.nodes()) > 0:
        # Create visualization
        plt.figure(figsize=(16, 12))
        pos = nx.spring_layout(G, k=3, iterations=50, seed=42)

        # Color nodes by type
        node_colors = [colors['primary'] if G.nodes[node]['type'] == 'deductive' else colors['secondary']
                       for node in G.nodes()]

        # Size nodes by degree
        node_sizes = [G.degree(node) * 100 for node in G.nodes()]

        # Draw network
        nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=node_sizes, alpha=0.7)

        # Draw edges with width based on weight
        edges = G.edges()
        weights = [G[u][v]['weight'] for u, v in edges]
        nx.draw_networkx_edges(G, pos, width=weights, alpha=0.3)

        # Draw labels
        nx.draw_networkx_labels(G, pos, font_size=8)

        plt.title(f'Code Co-occurrence Network (min {min_cooccurrence} co-occurrences)',
                  fontsize=16, fontweight='bold', color=colors['primary'])
        plt.axis('off')
        plt.tight_layout()
        plt.savefig(f"{config['output_folder']}/code_network.png", dpi=300, bbox_inches='tight')
        plt.show()

## Export Functions

Export analysis results to Excel and create formatted Word documents with themes.

In [None]:
def export_complete_analysis(df_coded, themes, inductive_results, code_patterns,
                           integration_stats, coder):
    """
    Export all analysis results to Excel and Word.
    """
    print("\n💾 EXPORTING ANALYSIS RESULTS")
    print("-" * 40)

    timestamp = config['timestamp']
    excel_filename = f"{config['output_folder']}/coding_analysis_{timestamp}.xlsx"
    word_filename = f"{config['output_folder']}/themes_report_{timestamp}.docx"

    # Export to Excel
    try:
        with pd.ExcelWriter(excel_filename, engine='openpyxl') as writer:
            # 1. Coded Data
            df_coded.to_excel(writer, sheet_name='Coded_Data', index=False)

            # 2. Coding Summary
            summary_data = {
                'Metric': ['Total Chunks', 'Deductive Only', 'Inductive Only',
                          'Both Types', 'No Codes', 'Total Coded'],
                'Count': [integration_stats['total_chunks'],
                         integration_stats['deductive_only'],
                         integration_stats['inductive_only'],
                         integration_stats['both_types'],
                         integration_stats['no_codes'],
                         integration_stats['coded_chunks']]
            }
            summary_df = pd.DataFrame(summary_data)
            summary_df.to_excel(writer, sheet_name='Coding_Summary', index=False)

            # 3. Deductive Codebook
            if coder and hasattr(coder, 'codebook_df'):
                coder.codebook_df.to_excel(writer, sheet_name='Deductive_Codebook', index=False)

            # 4. Inductive Codes
            if inductive_results and 'discovered_codes' in inductive_results:
                inductive_codes_data = []
                for code, details in inductive_results['discovered_codes'].items():
                    inductive_codes_data.append({
                        'Code': code,
                        'Definition': details.get('definition', ''),
                        'Rationale': details.get('rationale', ''),
                        'Application': details.get('application', ''),
                        'Example': details.get('example', '')
                    })
                if inductive_codes_data:
                    inductive_codes_df = pd.DataFrame(inductive_codes_data)
                    inductive_codes_df.to_excel(writer, sheet_name='Inductive_Codes', index=False)

            # 5. Code Frequencies
            freq_data = []
            for code, freq in code_patterns['all_codes_frequency'].items():
                code_type = 'Inductive' if code.endswith('_IND') else 'Deductive'
                freq_data.append({
                    'Code': code,
                    'Type': code_type,
                    'Frequency': freq
                })
            freq_df = pd.DataFrame(freq_data)
            freq_df.to_excel(writer, sheet_name='Code_Frequencies', index=False)

            # 6. Code Combinations
            if 'frequent_combinations' in code_patterns:
                combo_data = []
                for combo, freq in list(code_patterns['frequent_combinations'].items())[:20]:
                    combo_data.append({
                        'Combination': ' + '.join(combo),
                        'Frequency': freq
                    })
                if combo_data:
                    combo_df = pd.DataFrame(combo_data)
                    combo_df.to_excel(writer, sheet_name='Code_Combinations', index=False)

        print(f"✅ Excel file saved: {excel_filename}")

    except Exception as e:
        print(f"❌ Error exporting Excel: {e}")

    # Export themes to Word
    try:
        create_themes_document(themes, word_filename)
        print(f"✅ Word document saved: {word_filename}")

    except Exception as e:
        print(f"❌ Error creating Word document: {e}")

    print("\n📁 Files created in folder:")
    print(f"   {config['output_folder']}/")
    print("   ├── coding_analysis_[timestamp].xlsx")
    print("   ├── themes_report_[timestamp].docx")
    print("   ├── coding_distribution.png")
    print("   ├── code_frequencies.png")
    print("   ├── code_cooccurrence.png")
    if os.path.exists(f"{config['output_folder']}/code_network.png"):
        print("   └── code_network.png")


def create_themes_document(themes, filename):
    """
    Create a formatted Word document with themes.
    """
    doc = Document()

    # Add title
    title = doc.add_heading('Qualitative Analysis Themes Report', 0)
    title.alignment = WD_ALIGN_PARAGRAPH.CENTER

    # Add metadata
    doc.add_paragraph(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    doc.add_paragraph(f"Analysis Type: {config['coding_approach'].title()}")
    doc.add_paragraph(f"AI Model Used: {'Yes - ' + config['ai_model'] if config['use_ai'] else 'No'}")

    doc.add_page_break()

    # Add themes
    doc.add_heading('Discovered Themes', 1)

    if themes and 'themes_analysis' in themes:
        themes_text = themes['themes_analysis']

        # Parse and format themes
        lines = themes_text.split('\n')

        for line in lines:
            line = line.strip()

            if line.startswith('THEME'):
                # Main theme heading
                doc.add_heading(line, 2)

            elif line.startswith('Core Concept:'):
                # Core concept paragraph
                p = doc.add_paragraph()
                p.add_run('Core Concept: ').bold = True
                p.add_run(line.replace('Core Concept:', '').strip())

            elif line.startswith('Sub-themes:'):
                # Sub-themes heading
                p = doc.add_paragraph()
                p.add_run('Sub-themes:').bold = True

            elif line.strip().startswith(('a)', 'b)', 'c)')):
                # Sub-theme items
                doc.add_paragraph(line, style='List Bullet')

            elif line.startswith('Key Finding:'):
                # Key finding
                p = doc.add_paragraph()
                p.add_run('Key Finding: ').bold = True
                p.add_run(line.replace('Key Finding:', '').strip())

            elif line.startswith('Evidence Strength:'):
                # Evidence strength
                p = doc.add_paragraph()
                p.add_run('Evidence Strength: ').bold = True
                p.add_run(line.replace('Evidence Strength:', '').strip())
                doc.add_paragraph()  # Add spacing

            elif line and not line.isspace():
                # Other content
                doc.add_paragraph(line)

    else:
        doc.add_paragraph("No themes were generated in this analysis.")

    # Save document
    doc.save(filename)

## Main Analysis Execution

Run the complete analysis workflow based on the configured settings. This cell orchestrates all the analysis steps from coding through theme building and export.

In [None]:
def run_complete_analysis():
    """
    Run the complete analysis workflow based on configuration.
    """
    print("\n🚀 STARTING QUALITATIVE CODING ANALYSIS")
    print("=" * 60)

    try:
        # Check if files are loaded
        if config['codebook_df'] is None or config['transcript_df'] is None:
            print("❌ Please load files using the configuration interface first!")
            return None

        # Initialize the deductive coder
        coder = DeductiveCoder(config['codebook_df'])
        coder.display_codebook_summary()

        # Initialize results variables
        df_coded = config['transcript_df'].copy()
        inductive_results = None

        # Determine coding approach
        approach = config['coding_approach'].lower()

        # Step 1: Deductive Coding (if not inductive only)
        if 'inductive only' not in approach:
            print("\n📝 PHASE 1: DEDUCTIVE CODING")
            claude_coder = ClaudeAutoCoder(
                api_key=config['api_key'],
                codebook_df=config['codebook_df'],
                coder=coder,
                use_ai=config['use_ai']
            )
            df_coded = claude_coder.code_batch(df_coded)
        else:
            claude_coder = None

        # Step 2: Inductive Coding (if not deductive only)
        if 'deductive only' not in approach:
            print("\n📝 PHASE 2: INDUCTIVE CODING")
            # Need a claude_coder instance even for inductive only
            if claude_coder is None:
                claude_coder = ClaudeAutoCoder(
                    api_key=config['api_key'],
                    codebook_df=config['codebook_df'],
                    coder=coder,
                    use_ai=config['use_ai']
                )

            inductive_coder = ClaudeInductiveCoder(claude_coder, df_coded)
            inductive_results = inductive_coder.generate_inductive_codes(sample_size=60)

            # Display discovered codes
            if 'discovered_codes' in inductive_results:
                print("\n📋 Discovered Inductive Codes:")
                for code, details in inductive_results['discovered_codes'].items():
                    print(f"\n• {code}")
                    print(f"  Definition: {details['definition']}")
                    print(f"  When to apply: {details['application']}")

            # Apply inductive codes
            df_coded = inductive_coder.apply_inductive_codes()

        # Step 3: Integrate coding approaches
        print("\n📝 PHASE 3: INTEGRATION & ANALYSIS")
        df_coded, integration_stats = integrate_coding_approaches(df_coded)

        # Step 4: Analyze patterns
        code_patterns = analyze_code_patterns(df_coded)

        # Step 5: Build themes
        print("\n📝 PHASE 4: THEME BUILDING")
        theme_builder = IntegratedThemeBuilder(
            claude_coder=claude_coder,
            coded_df=df_coded,
            code_patterns=code_patterns,
            inductive_results=inductive_results
        )
        themes = theme_builder.build_themes()

        # Display themes
        if 'themes_analysis' in themes:
            print("\n📋 GENERATED THEMES:")
            print("=" * 60)
            print(themes['themes_analysis'])

        # Step 6: Create visualizations
        print("\n📝 PHASE 5: VISUALIZATIONS")
        create_comprehensive_visualizations(df_coded, code_patterns, themes, coder)

        # Step 7: Export results
        print("\n📝 PHASE 6: EXPORT")
        export_complete_analysis(
            df_coded=df_coded,
            themes=themes,
            inductive_results=inductive_results,
            code_patterns=code_patterns,
            integration_stats=integration_stats,
            coder=coder
        )

        # Final summary
        print("\n" + "=" * 60)
        print("📊 QUALITATIVE ANALYSIS COMPLETE")
        print("=" * 60)
        print(f"\n✅ Analysis Type: {config['coding_approach']}")
        print(f"✅ Total chunks analyzed: {integration_stats['total_chunks']}")
        print(f"✅ Successfully coded: {integration_stats['coded_chunks']}")
        print(f"✅ Unique codes used: {code_patterns['unique_codes']}")
        print(f"\n📁 All results saved to: {config['output_folder']}/")

        return {
            'df_coded': df_coded,
            'themes': themes,
            'code_patterns': code_patterns,
            'integration_stats': integration_stats,
            'inductive_results': inductive_results
        }

    except Exception as e:
        print(f"\n❌ Error in analysis workflow: {e}")
        import traceback
        traceback.print_exc()
        return None

# Run the analysis when ready
print("\n✅ Analysis function ready!")
print("📌 To run the analysis:")
print("1. Upload your files using the configuration interface above")
print("2. Configure your analysis settings")
print("3. Click 'Start Analysis' button")
print("4. Then run: results = run_complete_analysis()")

## Execute


In [None]:
# Run the complete analysis
results = run_complete_analysis()