# ML-Based Open Coding Analysis

This notebook provides a **comprehensive ML-powered framework** for analyzing open-ended qualitative data with:

## 15 Essential Outputs:
1. **Code Assignments** - Which codes apply to each response with confidence scores
2. **Code Frame/Codebook** - Complete list of codes with definitions and examples
3. **Code Frequency Table** - Statistical distribution of code usage
4. **Confidence/Quality Metrics** - Model reliability and performance metrics
5. **Binary/Multi-Label Matrix** - Code presence/absence for statistical analysis
6. **Representative Quotes** - Top examples for each code
7. **Co-Occurrence Analysis** - Code relationship patterns
8. **Descriptive Statistics** - Comprehensive summary statistics
9. **Segmentation/Subgroup Analysis** - Code patterns across demographics
10. **Quality Assurance Report** - Validation and error analysis
11. **Visualizations** - Interactive charts and network diagrams
12. **Exportable Datasets** - Multiple format exports (CSV, Excel, JSON)
13. **Model/Method Documentation** - Transparent methodology
14. **Uncoded/Ambiguous Responses** - Edge cases and low-confidence items
15. **Executive Summary** - High-level insights for stakeholders

## Features:
- 🤖 Multiple ML algorithms (TF-IDF, embeddings, clustering)
- 📊 Advanced visualizations (networks, heatmaps, interactive plots)
- 📈 Statistical analysis and quality metrics
- 💾 Multiple export formats
- 🔍 Automatic theme discovery
- ✅ Quality assurance and validation

## 1. Setup and Imports

In [None]:
# Standard library
import os
import sys
import warnings
import json
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Union
from collections import Counter, defaultdict
from datetime import datetime
import logging

# Data manipulation
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer

# NLP
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download required NLTK data
try:
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)
except:
    print("Note: Some NLTK downloads may have failed. Continuing...")

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Word clouds
try:
    from wordcloud import WordCloud
    WORDCLOUD_AVAILABLE = True
except ImportError:
    WORDCLOUD_AVAILABLE = False
    print("Note: wordcloud not available. Install with: pip install wordcloud")

# Network analysis
try:
    import networkx as nx
    NETWORKX_AVAILABLE = True
except ImportError:
    NETWORKX_AVAILABLE = False
    print("Note: networkx not available. Install with: pip install networkx")

# Configure settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("✓ All imports successful")
print(f"✓ WordCloud available: {WORDCLOUD_AVAILABLE}")
print(f"✓ NetworkX available: {NETWORKX_AVAILABLE}")

## 2. ML-Based Coding Engine

Core engine for automatic code discovery and assignment using machine learning.

In [None]:
class MLOpenCoder:
    """
    ML-powered open coding system with automatic theme discovery.
    
    Features:
    - Automatic code discovery using topic modeling
    - Confidence scores for all assignments
    - Multiple algorithm support
    - Quality metrics and validation
    """
    
    def __init__(self, n_codes=10, method='tfidf_kmeans', min_confidence=0.3):
        """
        Initialize ML Open Coder.
        
        Args:
            n_codes: Number of codes/themes to discover
            method: Algorithm to use ('tfidf_kmeans', 'lda', 'nmf')
            min_confidence: Minimum confidence threshold for code assignment
        """
        self.n_codes = n_codes
        self.method = method
        self.min_confidence = min_confidence
        
        self.vectorizer = None
        self.model = None
        self.codebook = {}
        self.code_assignments = None
        self.confidence_scores = None
        self.feature_matrix = None
        
        # Initialize lemmatizer
        self.lemmatizer = WordNetLemmatizer()
        
    def preprocess_text(self, text):
        """Clean and preprocess text."""
        if pd.isna(text):
            return ""
        
        # Convert to lowercase
        text = str(text).lower()
        
        # Remove special characters but keep spaces
        text = re.sub(r'[^a-z\s]', ' ', text)
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text
    
    def fit(self, responses, stop_words='english'):
        """
        Discover codes from responses using ML.
        
        Args:
            responses: List or Series of response texts
            stop_words: Stop words to remove
        """
        # Preprocess
        processed = [self.preprocess_text(r) for r in responses]
        
        # Vectorize
        if self.method == 'lda':
            self.vectorizer = CountVectorizer(
                max_features=1000,
                stop_words=stop_words,
                min_df=2,
                max_df=0.8
            )
            self.feature_matrix = self.vectorizer.fit_transform(processed)
            self.model = LatentDirichletAllocation(
                n_components=self.n_codes,
                random_state=42,
                max_iter=20
            )
        
        elif self.method == 'nmf':
            self.vectorizer = TfidfVectorizer(
                max_features=1000,
                stop_words=stop_words,
                min_df=2,
                max_df=0.8
            )
            self.feature_matrix = self.vectorizer.fit_transform(processed)
            self.model = NMF(
                n_components=self.n_codes,
                random_state=42,
                max_iter=200
            )
        
        else:  # tfidf_kmeans (default)
            self.vectorizer = TfidfVectorizer(
                max_features=1000,
                stop_words=stop_words,
                min_df=2,
                max_df=0.8,
                ngram_range=(1, 2)
            )
            self.feature_matrix = self.vectorizer.fit_transform(processed)
            self.model = KMeans(
                n_clusters=self.n_codes,
                random_state=42,
                n_init=10
            )
        
        # Fit model
        logger.info(f"Fitting {self.method} model with {self.n_codes} codes...")
        
        if self.method in ['lda', 'nmf']:
            doc_topic_matrix = self.model.fit_transform(self.feature_matrix)
        else:
            labels = self.model.fit_predict(self.feature_matrix)
            # Convert to topic distribution
            doc_topic_matrix = np.zeros((len(responses), self.n_codes))
            for i, label in enumerate(labels):
                doc_topic_matrix[i, label] = 1.0
        
        # Generate codebook
        self._generate_codebook()
        
        # Assign codes with confidence
        self._assign_codes(doc_topic_matrix, responses)
        
        logger.info(f"✓ Model fitted successfully")
        
        return self
    
    def _generate_codebook(self, top_words=10):
        """Generate codebook from model."""
        feature_names = self.vectorizer.get_feature_names_out()
        
        for code_idx in range(self.n_codes):
            code_id = f"CODE_{code_idx + 1:02d}"
            
            # Get top words for this code
            if self.method in ['lda', 'nmf']:
                topic_weights = self.model.components_[code_idx]
                top_indices = topic_weights.argsort()[-top_words:][::-1]
            else:  # kmeans
                cluster_center = self.model.cluster_centers_[code_idx]
                top_indices = cluster_center.argsort()[-top_words:][::-1]
            
            top_words_list = [feature_names[i] for i in top_indices]
            
            # Generate label from top words
            label = ' '.join(top_words_list[:3]).title()
            
            self.codebook[code_id] = {
                'label': label,
                'keywords': top_words_list,
                'count': 0,
                'examples': [],
                'avg_confidence': 0.0
            }
    
    def _assign_codes(self, doc_topic_matrix, responses):
        """Assign codes to documents with confidence scores."""
        assignments = []
        confidences = []
        
        for doc_idx, topic_dist in enumerate(doc_topic_matrix):
            # Get codes above confidence threshold
            doc_codes = []
            doc_confidences = []
            
            for code_idx, confidence in enumerate(topic_dist):
                if confidence >= self.min_confidence:
                    code_id = f"CODE_{code_idx + 1:02d}"
                    doc_codes.append(code_id)
                    doc_confidences.append(float(confidence))
                    
                    # Update codebook stats
                    self.codebook[code_id]['count'] += 1
                    
                    # Store example if confidence is high
                    if confidence > 0.6 and len(self.codebook[code_id]['examples']) < 10:
                        self.codebook[code_id]['examples'].append({
                            'text': str(responses[doc_idx]),
                            'confidence': float(confidence)
                        })
            
            assignments.append(doc_codes)
            confidences.append(doc_confidences)
        
        # Calculate average confidence per code
        for doc_codes, doc_confs in zip(assignments, confidences):
            for code, conf in zip(doc_codes, doc_confs):
                if self.codebook[code]['count'] > 0:
                    current_avg = self.codebook[code]['avg_confidence']
                    count = self.codebook[code]['count']
                    self.codebook[code]['avg_confidence'] = (
                        (current_avg * (count - 1) + conf) / count
                    )
        
        self.code_assignments = assignments
        self.confidence_scores = confidences
    
    def get_codebook_df(self):
        """Return codebook as DataFrame."""
        data = []
        for code_id, info in self.codebook.items():
            data.append({
                'Code ID': code_id,
                'Label': info['label'],
                'Keywords': ', '.join(info['keywords'][:5]),
                'Count': info['count'],
                'Percentage': (info['count'] / len(self.code_assignments) * 100) if self.code_assignments else 0,
                'Avg Confidence': info['avg_confidence']
            })
        
        return pd.DataFrame(data).sort_values('Count', ascending=False)
    
    def get_quality_metrics(self):
        """Calculate quality and reliability metrics."""
        metrics = {}
        
        # Basic statistics
        total_assignments = sum(len(codes) for codes in self.code_assignments)
        metrics['total_assignments'] = total_assignments
        metrics['avg_codes_per_response'] = total_assignments / len(self.code_assignments)
        
        # Coverage
        coded_responses = sum(1 for codes in self.code_assignments if len(codes) > 0)
        metrics['coverage_pct'] = (coded_responses / len(self.code_assignments)) * 100
        
        # Confidence statistics
        all_confidences = [conf for confs in self.confidence_scores for conf in confs]
        if all_confidences:
            metrics['avg_confidence'] = np.mean(all_confidences)
            metrics['min_confidence'] = np.min(all_confidences)
            metrics['max_confidence'] = np.max(all_confidences)
            metrics['std_confidence'] = np.std(all_confidences)
        
        # Clustering quality (if available)
        if self.feature_matrix is not None and hasattr(self.model, 'cluster_centers_'):
            labels = self.model.labels_
            if len(set(labels)) > 1:
                metrics['silhouette_score'] = silhouette_score(
                    self.feature_matrix, labels
                )
                metrics['davies_bouldin_score'] = davies_bouldin_score(
                    self.feature_matrix.toarray(), labels
                )
                metrics['calinski_harabasz_score'] = calinski_harabasz_score(
                    self.feature_matrix.toarray(), labels
                )
        
        return metrics

print("✓ MLOpenCoder class defined")

## 3. Analysis Results Package

Complete results package with all 15 essential outputs.

In [None]:
class OpenCodingResults:
    """
    Comprehensive results package for ML-based open coding.
    
    Provides all 15 essential outputs:
    1. Code Assignments
    2. Codebook
    3. Frequency Tables
    4. Quality Metrics
    5. Binary Matrix
    6. Representative Quotes
    7. Co-occurrence Analysis
    8. Descriptive Statistics
    9. Segmentation Analysis
    10. QA Report
    11. Visualizations
    12. Exports
    13. Documentation
    14. Uncoded Responses
    15. Executive Summary
    """
    
    def __init__(self, df, coder: MLOpenCoder, response_col='response', id_col='response_id'):
        self.df = df.copy()
        self.coder = coder
        self.response_col = response_col
        self.id_col = id_col
        
        # Ensure ID column exists
        if id_col not in self.df.columns:
            self.df[id_col] = range(1, len(self.df) + 1)
        
        # Add coding results to dataframe
        self.df['assigned_codes'] = coder.code_assignments
        self.df['confidence_scores'] = coder.confidence_scores
        self.df['num_codes'] = self.df['assigned_codes'].apply(len)
        
    # OUTPUT 1: Code Assignments
    def get_code_assignments(self):
        """Get complete code assignments with confidence scores."""
        return self.df[[self.id_col, self.response_col, 'assigned_codes', 'confidence_scores']]
    
    # OUTPUT 2: Codebook
    def get_codebook(self):
        """Get complete codebook with definitions and examples."""
        return self.coder.get_codebook_df()
    
    def get_codebook_detailed(self):
        """Get detailed codebook with examples."""
        data = []
        for code_id, info in self.coder.codebook.items():
            # Get top 3 examples
            examples = sorted(info['examples'], key=lambda x: x['confidence'], reverse=True)[:3]
            example_text = ' | '.join([f"{ex['text'][:50]}..." for ex in examples])
            
            data.append({
                'Code ID': code_id,
                'Label': info['label'],
                'Definition': f"Responses about {info['label'].lower()}",
                'Keywords': ', '.join(info['keywords']),
                'Examples': example_text if example_text else 'N/A',
                'Count': info['count'],
                'Percentage': (info['count'] / len(self.df) * 100),
                'Avg Confidence': info['avg_confidence']
            })
        
        return pd.DataFrame(data).sort_values('Count', ascending=False)
    
    # OUTPUT 3: Code Frequency Table
    def get_frequency_table(self):
        """Get code frequency statistics."""
        freq_data = []
        
        for code_id, info in self.coder.codebook.items():
            count = info['count']
            pct = (count / len(self.df)) * 100
            
            freq_data.append({
                'Code': code_id,
                'Label': info['label'],
                'Count': count,
                'Percentage': pct,
                'Rank': 0  # Will be filled
            })
        
        freq_df = pd.DataFrame(freq_data).sort_values('Count', ascending=False)
        freq_df['Rank'] = range(1, len(freq_df) + 1)
        
        return freq_df
    
    # OUTPUT 4: Quality Metrics
    def get_quality_metrics(self):
        """Get comprehensive quality and confidence metrics."""
        return self.coder.get_quality_metrics()
    
    # OUTPUT 5: Binary Matrix
    def get_binary_matrix(self):
        """Get binary code matrix for statistical analysis."""
        # Create binary columns for each code
        binary_df = self.df[[self.id_col]].copy()
        
        for code_id in self.coder.codebook.keys():
            binary_df[f'code_{code_id}'] = self.df['assigned_codes'].apply(
                lambda codes: 1 if code_id in codes else 0
            )
        
        return binary_df
    
    # OUTPUT 6: Representative Quotes
    def get_representative_quotes(self, top_n=5):
        """Get top representative quotes for each code."""
        quotes = {}
        
        for code_id, info in self.coder.codebook.items():
            # Sort examples by confidence
            sorted_examples = sorted(
                info['examples'],
                key=lambda x: x['confidence'],
                reverse=True
            )[:top_n]
            
            quotes[code_id] = {
                'label': info['label'],
                'quotes': [
                    {
                        'text': ex['text'],
                        'confidence': ex['confidence']
                    }
                    for ex in sorted_examples
                ]
            }
        
        return quotes
    
    # OUTPUT 7: Co-occurrence Analysis
    def get_cooccurrence_matrix(self):
        """Calculate code co-occurrence matrix efficiently."""
        from itertools import combinations
        
        codes = list(self.coder.codebook.keys())
        n = len(codes)
        code_to_idx = {code: i for i, code in enumerate(codes)}
        cooccur = np.zeros((n, n))
        
        # Only iterate over assigned code pairs instead of all possible pairs
        for assigned_codes in self.df['assigned_codes']:
            # Process pairs of codes that were actually assigned
            for code1, code2 in combinations(assigned_codes, 2):
                i, j = code_to_idx[code1], code_to_idx[code2]
                cooccur[i, j] += 1
                cooccur[j, i] += 1  # Symmetric matrix
            
            # Diagonal: each code co-occurs with itself
            for code in assigned_codes:
                i = code_to_idx[code]
                cooccur[i, i] += 1
        
        # Create DataFrame
        labels = [self.coder.codebook[c]['label'] for c in codes]
        cooccur_df = pd.DataFrame(cooccur, index=labels, columns=labels)
        
        return cooccur_df
    
    def get_cooccurrence_pairs(self, min_count=2):
        """Get code pairs that frequently co-occur."""
        pairs = Counter()
        
        for assigned_codes in self.df['assigned_codes']:
            for i, code1 in enumerate(assigned_codes):
                for code2 in assigned_codes[i+1:]:
                    pair = tuple(sorted([code1, code2]))
                    pairs[pair] += 1
        
        # Convert to DataFrame
        pair_data = []
        for (code1, code2), count in pairs.most_common():
            if count >= min_count:
                label1 = self.coder.codebook[code1]['label']
                label2 = self.coder.codebook[code2]['label']
                
                pair_data.append({
                    'Code 1': code1,
                    'Label 1': label1,
                    'Code 2': code2,
                    'Label 2': label2,
                    'Co-occurrence Count': count,
                    'Percentage': (count / len(self.df)) * 100
                })
        
        return pd.DataFrame(pair_data)
    
    # OUTPUT 8: Descriptive Statistics
    def get_descriptive_stats(self):
        """Get comprehensive descriptive statistics."""
        stats = {
            'Total Responses': len(self.df),
            'Total Codes Defined': len(self.coder.codebook),
            'Total Code Assignments': self.df['num_codes'].sum(),
            'Mean Codes per Response': self.df['num_codes'].mean(),
            'Median Codes per Response': self.df['num_codes'].median(),
            'Std Dev Codes per Response': self.df['num_codes'].std(),
            'Min Codes per Response': self.df['num_codes'].min(),
            'Max Codes per Response': self.df['num_codes'].max(),
            'Responses with 0 Codes': (self.df['num_codes'] == 0).sum(),
            'Responses with 1+ Codes': (self.df['num_codes'] > 0).sum(),
            'Coverage %': ((self.df['num_codes'] > 0).sum() / len(self.df)) * 100
        }
        
        return pd.Series(stats)
    
    # OUTPUT 9: Segmentation Analysis
    def get_segmentation_analysis(self, segment_col):
        """Analyze code patterns across demographic segments."""
        if segment_col not in self.df.columns:
            return None
        
        # Get unique segments
        segments = self.df[segment_col].unique()
        
        seg_data = []
        for segment in segments:
            seg_df = self.df[self.df[segment_col] == segment]
            
            # Count codes in this segment
            for code_id, info in self.coder.codebook.items():
                count = sum(1 for codes in seg_df['assigned_codes'] if code_id in codes)
                pct = (count / len(seg_df)) * 100 if len(seg_df) > 0 else 0
                
                seg_data.append({
                    'Segment': segment,
                    'Code': code_id,
                    'Label': info['label'],
                    'Count': count,
                    'Percentage': pct
                })
        
        return pd.DataFrame(seg_data)
    
    # OUTPUT 10: Quality Assurance Report
    def get_qa_report(self, sample_size=10):
        """Generate quality assurance report."""
        report = {
            'timestamp': datetime.now().isoformat(),
            'method': self.coder.method,
            'total_responses': len(self.df),
            'quality_metrics': self.get_quality_metrics(),
            'low_confidence_count': sum(
                1 for confs in self.df['confidence_scores']
                if any(c < 0.5 for c in confs)
            ),
            'uncoded_count': (self.df['num_codes'] == 0).sum(),
            'multi_coded_count': (self.df['num_codes'] > 1).sum(),
        }
        
        return report
    
    # OUTPUT 14: Uncoded/Ambiguous Responses
    def get_uncoded_responses(self):
        """Get responses with no codes assigned."""
        uncoded = self.df[self.df['num_codes'] == 0]
        return uncoded[[self.id_col, self.response_col]]
    
    def get_low_confidence_responses(self, threshold=0.5):
        """Get responses with low confidence scores."""
        low_conf = self.df[
            self.df['confidence_scores'].apply(
                lambda confs: any(c < threshold for c in confs) if confs else False
            )
        ]
        return low_conf[[self.id_col, self.response_col, 'assigned_codes', 'confidence_scores']]
    
    def get_ambiguous_responses(self, min_codes=3):
        """Get responses with many codes (potentially ambiguous)."""
        ambiguous = self.df[self.df['num_codes'] >= min_codes]
        return ambiguous[[self.id_col, self.response_col, 'assigned_codes', 'confidence_scores']]

print("✓ OpenCodingResults class defined")

## 4. Visualization Engine

Comprehensive visualization suite for all outputs.

In [None]:
class CodingVisualizer:
    """Comprehensive visualization engine for open coding results."""
    
    def __init__(self, results: OpenCodingResults):
        self.results = results
    
    def plot_frequency_chart(self, top_n=15):
        """Bar chart of code frequencies."""
        freq_df = self.results.get_frequency_table().head(top_n)
        
        fig = px.bar(
            freq_df,
            x='Label',
            y='Count',
            color='Percentage',
            title=f'Top {top_n} Code Frequencies',
            labels={'Count': 'Number of Responses', 'Label': 'Code'},
            color_continuous_scale='Viridis',
            text='Count'
        )
        fig.update_traces(textposition='outside')
        fig.update_layout(xaxis_tickangle=-45, height=500)
        return fig
    
    def plot_cooccurrence_heatmap(self):
        """Heatmap of code co-occurrences."""
        cooccur_df = self.results.get_cooccurrence_matrix()
        
        fig = px.imshow(
            cooccur_df,
            labels=dict(color="Co-occurrences"),
            title="Code Co-occurrence Matrix",
            color_continuous_scale='YlOrRd',
            aspect='auto'
        )
        fig.update_layout(height=600)
        return fig
    
    def plot_network_diagram(self, min_cooccurrence=2):
        """Network diagram of code relationships."""
        if not NETWORKX_AVAILABLE:
            print("NetworkX not available")
            return None
        
        # Build network
        G = nx.Graph()
        
        # Add nodes
        for code_id, info in self.results.coder.codebook.items():
            G.add_node(code_id, label=info['label'], count=info['count'])
        
        # Add edges from co-occurrences
        pairs_df = self.results.get_cooccurrence_pairs(min_count=min_cooccurrence)
        for _, row in pairs_df.iterrows():
            G.add_edge(
                row['Code 1'],
                row['Code 2'],
                weight=row['Co-occurrence Count']
            )
        
        if len(G.nodes) == 0:
            print("No nodes to visualize")
            return None
        
        # Layout
        pos = nx.spring_layout(G, k=1, iterations=50)
        
        # Create edge traces
        edge_traces = []
        for edge in G.edges():
            x0, y0 = pos[edge[0]]
            x1, y1 = pos[edge[1]]
            weight = G.edges[edge]['weight']
            
            edge_traces.append(
                go.Scatter(
                    x=[x0, x1, None],
                    y=[y0, y1, None],
                    mode='lines',
                    line=dict(width=weight, color='#888'),
                    hoverinfo='none',
                    showlegend=False
                )
            )
        
        # Create node trace
        node_x = []
        node_y = []
        node_text = []
        node_size = []
        
        for node in G.nodes():
            x, y = pos[node]
            node_x.append(x)
            node_y.append(y)
            label = G.nodes[node]['label']
            count = G.nodes[node]['count']
            node_text.append(f"{label}<br>Count: {count}")
            node_size.append(10 + count * 2)
        
        node_trace = go.Scatter(
            x=node_x,
            y=node_y,
            mode='markers+text',
            hovertext=node_text,
            hoverinfo='text',
            marker=dict(
                size=node_size,
                color='lightblue',
                line=dict(width=2, color='darkblue')
            ),
            text=[G.nodes[node]['label'][:15] for node in G.nodes()],
            textposition='top center',
            showlegend=False
        )
        
        # Create figure
        fig = go.Figure(data=edge_traces + [node_trace])
        fig.update_layout(
            title='Code Co-occurrence Network',
            showlegend=False,
            hovermode='closest',
            height=700,
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
        )
        
        return fig
    
    def plot_distribution_histogram(self):
        """Distribution of codes per response."""
        fig = px.histogram(
            self.results.df,
            x='num_codes',
            title='Distribution of Codes per Response',
            labels={'num_codes': 'Number of Codes', 'count': 'Frequency'},
            nbins=max(self.results.df['num_codes'].max(), 5)
        )
        fig.update_layout(height=400)
        return fig
    
    def plot_wordcloud(self, code_id=None):
        """Generate word cloud for responses (optionally filtered by code)."""
        if not WORDCLOUD_AVAILABLE:
            print("WordCloud not available")
            return None
        
        if code_id:
            # Filter by code
            filtered = self.results.df[
                self.results.df['assigned_codes'].apply(lambda x: code_id in x)
            ]
            title = f"Word Cloud - {self.results.coder.codebook[code_id]['label']}"
        else:
            filtered = self.results.df
            title = "Word Cloud - All Responses"
        
        text = ' '.join(filtered[self.results.response_col].astype(str))
        
        wordcloud = WordCloud(
            width=1000,
            height=500,
            background_color='white',
            colormap='viridis'
        ).generate(text)
        
        fig, ax = plt.subplots(figsize=(15, 7))
        ax.imshow(wordcloud, interpolation='bilinear')
        ax.axis('off')
        ax.set_title(title, fontsize=16, fontweight='bold')
        plt.tight_layout()
        return fig
    
    def plot_confidence_distribution(self):
        """Distribution of confidence scores."""
        all_confidences = [
            conf
            for confs in self.results.df['confidence_scores']
            for conf in confs
        ]
        
        fig = px.histogram(
            x=all_confidences,
            nbins=30,
            title='Distribution of Confidence Scores',
            labels={'x': 'Confidence Score', 'y': 'Frequency'}
        )
        fig.update_layout(height=400)
        return fig
    
    def plot_segmentation(self, segment_col, top_codes=5):
        """Compare code distribution across segments."""
        seg_df = self.results.get_segmentation_analysis(segment_col)
        if seg_df is None:
            print(f"Column '{segment_col}' not found")
            return None
        
        # Get top codes overall
        top_code_ids = self.results.get_frequency_table().head(top_codes)['Code'].tolist()
        filtered = seg_df[seg_df['Code'].isin(top_code_ids)]
        
        fig = px.bar(
            filtered,
            x='Segment',
            y='Percentage',
            color='Label',
            barmode='group',
            title=f'Top {top_codes} Codes by {segment_col}',
            labels={'Percentage': 'Percentage of Responses'}
        )
        fig.update_layout(height=500)
        return fig

print("✓ CodingVisualizer class defined")

## 5. Export Manager

Export results in multiple formats.

In [None]:
class ResultsExporter:
    """Export coding results in multiple formats."""
    
    def __init__(self, results: OpenCodingResults, output_dir='output'):
        self.results = results
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        
        # Create timestamped subfolder
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        self.run_dir = self.output_dir / f'coding_run_{timestamp}'
        self.run_dir.mkdir(exist_ok=True)
    
    def export_all(self):
        """Export all outputs."""
        print(f"Exporting results to: {self.run_dir}")
        
        # 1. Code Assignments (CSV)
        assignments = self.results.get_code_assignments()
        assignments.to_csv(self.run_dir / 'code_assignments.csv', index=False)
        print("✓ Exported code_assignments.csv")
        
        # 2. Codebook (CSV)
        codebook = self.results.get_codebook_detailed()
        codebook.to_csv(self.run_dir / 'codebook.csv', index=False)
        print("✓ Exported codebook.csv")
        
        # 3. Frequency Table (CSV)
        freq = self.results.get_frequency_table()
        freq.to_csv(self.run_dir / 'frequency_table.csv', index=False)
        print("✓ Exported frequency_table.csv")
        
        # 4. Quality Metrics (JSON)
        metrics = self.results.get_quality_metrics()
        with open(self.run_dir / 'quality_metrics.json', 'w') as f:
            json.dump(metrics, f, indent=2, default=str)
        print("✓ Exported quality_metrics.json")
        
        # 5. Binary Matrix (CSV)
        binary = self.results.get_binary_matrix()
        binary.to_csv(self.run_dir / 'binary_matrix.csv', index=False)
        print("✓ Exported binary_matrix.csv")
        
        # 6. Representative Quotes (JSON)
        quotes = self.results.get_representative_quotes()
        with open(self.run_dir / 'representative_quotes.json', 'w') as f:
            json.dump(quotes, f, indent=2)
        print("✓ Exported representative_quotes.json")
        
        # 7. Co-occurrence Matrix (CSV)
        cooccur = self.results.get_cooccurrence_matrix()
        cooccur.to_csv(self.run_dir / 'cooccurrence_matrix.csv')
        print("✓ Exported cooccurrence_matrix.csv")
        
        # Co-occurrence Pairs (CSV)
        pairs = self.results.get_cooccurrence_pairs()
        pairs.to_csv(self.run_dir / 'cooccurrence_pairs.csv', index=False)
        print("✓ Exported cooccurrence_pairs.csv")
        
        # 8. Descriptive Statistics (CSV)
        stats = self.results.get_descriptive_stats()
        stats.to_csv(self.run_dir / 'descriptive_statistics.csv', header=['Value'])
        print("✓ Exported descriptive_statistics.csv")
        
        # 10. QA Report (JSON)
        qa = self.results.get_qa_report()
        with open(self.run_dir / 'qa_report.json', 'w') as f:
            json.dump(qa, f, indent=2, default=str)
        print("✓ Exported qa_report.json")
        
        # 14. Uncoded Responses (CSV)
        uncoded = self.results.get_uncoded_responses()
        uncoded.to_csv(self.run_dir / 'uncoded_responses.csv', index=False)
        print("✓ Exported uncoded_responses.csv")
        
        # Low Confidence (CSV)
        low_conf = self.results.get_low_confidence_responses()
        low_conf.to_csv(self.run_dir / 'low_confidence_responses.csv', index=False)
        print("✓ Exported low_confidence_responses.csv")
        
        print(f"\n✓ All exports complete! Output directory: {self.run_dir}")
        return self.run_dir
    
    def export_excel(self, filename='coding_results.xlsx'):
        """Export all results to a single Excel file with multiple sheets."""
        excel_path = self.run_dir / filename
        
        with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
            # Code Assignments
            self.results.get_code_assignments().to_excel(
                writer, sheet_name='Code Assignments', index=False
            )
            
            # Codebook
            self.results.get_codebook_detailed().to_excel(
                writer, sheet_name='Codebook', index=False
            )
            
            # Frequency Table
            self.results.get_frequency_table().to_excel(
                writer, sheet_name='Frequency Table', index=False
            )
            
            # Descriptive Stats
            self.results.get_descriptive_stats().to_excel(
                writer, sheet_name='Statistics'
            )
            
            # Co-occurrence Pairs
            self.results.get_cooccurrence_pairs().to_excel(
                writer, sheet_name='Co-occurrences', index=False
            )
            
            # Binary Matrix
            binary = self.results.get_binary_matrix()
            if len(binary.columns) < 16384:  # Excel column limit
                binary.to_excel(writer, sheet_name='Binary Matrix', index=False)
            
            # Uncoded
            self.results.get_uncoded_responses().to_excel(
                writer, sheet_name='Uncoded Responses', index=False
            )
        
        print(f"✓ Exported comprehensive Excel file: {excel_path}")
        return excel_path

print("✓ ResultsExporter class defined")

## 6. Executive Summary Generator

OUTPUT 15: Generate executive summary for stakeholders.

In [None]:
class ExecutiveSummaryGenerator:
    """Generate executive summary for stakeholders."""
    
    def __init__(self, results: OpenCodingResults):
        self.results = results
    
    def generate(self, top_n_codes=5, top_n_quotes=3):
        """Generate comprehensive executive summary."""
        freq_table = self.results.get_frequency_table()
        stats = self.results.get_descriptive_stats()
        quotes = self.results.get_representative_quotes(top_n=top_n_quotes)
        
        summary = []
        summary.append("# EXECUTIVE SUMMARY")
        summary.append("="*60)
        summary.append("")
        
        # Overview
        summary.append("## Overview")
        summary.append(f"- **Total Responses Analyzed:** {stats['Total Responses']:,.0f}")
        summary.append(f"- **Themes Identified:** {stats['Total Codes Defined']:.0f}")
        summary.append(f"- **Coverage:** {stats['Coverage %']:.1f}% of responses coded")
        summary.append(f"- **Average Codes per Response:** {stats['Mean Codes per Response']:.2f}")
        summary.append("")
        
        # Top Themes
        summary.append(f"## Top {top_n_codes} Themes")
        summary.append("")
        
        for i, row in freq_table.head(top_n_codes).iterrows():
            summary.append(f"### {row['Rank']}. {row['Label']}")
            summary.append(f"   - **Frequency:** {row['Count']:,.0f} responses ({row['Percentage']:.1f}%)")
            
            # Add sample quote
            code_id = row['Code']
            if code_id in quotes and quotes[code_id]['quotes']:
                top_quote = quotes[code_id]['quotes'][0]
                summary.append(f"   - **Example:** \"{top_quote['text'][:150]}...\"")
            summary.append("")
        
        # Key Insights
        summary.append("## Key Insights")
        summary.append("")
        
        # Most prevalent theme
        top_code = freq_table.iloc[0]
        summary.append(
            f"1. **Dominant Theme:** '{top_code['Label']}' appears in "
            f"{top_code['Percentage']:.1f}% of responses, making it the most "
            f"prevalent theme in the data."
        )
        
        # Coverage insight
        if stats['Coverage %'] < 80:
            summary.append(
                f"2. **Coverage Note:** {stats['Responses with 0 Codes']:.0f} responses "
                f"({100 - stats['Coverage %']:.1f}%) were not assigned any codes, "
                f"suggesting they may require manual review or represent unique perspectives."
            )
        
        # Multi-coding insight
        multi_coded = self.results.df[self.results.df['num_codes'] > 1]
        if len(multi_coded) > 0:
            pct_multi = (len(multi_coded) / len(self.results.df)) * 100
            summary.append(
                f"3. **Complex Responses:** {len(multi_coded):,.0f} responses ({pct_multi:.1f}%) "
                f"were assigned multiple codes, indicating nuanced or multifaceted perspectives."
            )
        
        summary.append("")
        
        # Co-occurrence patterns
        pairs = self.results.get_cooccurrence_pairs()
        if len(pairs) > 0:
            summary.append("## Common Theme Combinations")
            summary.append("")
            for i, row in pairs.head(3).iterrows():
                summary.append(
                    f"- **{row['Label 1']}** + **{row['Label 2']}**: "
                    f"{row['Co-occurrence Count']:.0f} responses ({row['Percentage']:.1f}%)"
                )
            summary.append("")
        
        # Quality metrics
        metrics = self.results.get_quality_metrics()
        summary.append("## Quality Metrics")
        summary.append("")
        summary.append(f"- **Average Confidence:** {metrics.get('avg_confidence', 0):.2f}")
        summary.append(f"- **Method Used:** {self.results.coder.method.upper()}")
        summary.append("")
        
        # Recommendations
        summary.append("## Recommendations")
        summary.append("")
        summary.append("1. **Focus Areas:** Prioritize initiatives related to the top 3 themes.")
        summary.append("2. **Further Investigation:** Review uncoded and low-confidence responses manually.")
        summary.append("3. **Segmentation:** Conduct demographic analysis to identify group-specific patterns.")
        summary.append("")
        
        return "\n".join(summary)
    
    def save(self, filename='executive_summary.md', output_dir='output'):
        """Save executive summary to file."""
        summary_text = self.generate()
        
        output_path = Path(output_dir) / filename
        output_path.parent.mkdir(exist_ok=True)
        
        with open(output_path, 'w') as f:
            f.write(summary_text)
        
        print(f"✓ Executive summary saved to: {output_path}")
        return output_path

print("✓ ExecutiveSummaryGenerator class defined")

## 7. Example Analysis: Remote Work Responses

Complete end-to-end analysis demonstrating all 15 outputs.

### 7.1 Load Data

In [None]:
# Load sample data
df = pd.read_csv('data/sample_responses.csv')

print(f"Loaded {len(df)} responses")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few responses:")
df.head()

### 7.2 Train ML Coder

In [None]:
# Initialize and fit ML coder
coder = MLOpenCoder(
    n_codes=10,              # Number of themes to discover
    method='tfidf_kmeans',   # Algorithm: 'tfidf_kmeans', 'lda', or 'nmf'
    min_confidence=0.3       # Minimum confidence threshold
)

# Fit on responses
coder.fit(df['response'])

print("\n" + "="*60)
print("ML CODING COMPLETE")
print("="*60)

### 7.3 Generate Results Package

In [None]:
# Create results package
results = OpenCodingResults(df, coder, response_col='response')

print("✓ Results package created")
print(f"  - {len(results.df)} responses coded")
print(f"  - {len(results.coder.codebook)} codes identified")

## 8. OUTPUT 1: Code Assignments

Complete code assignments with confidence scores.

In [None]:
# Get code assignments
assignments = results.get_code_assignments()

print("CODE ASSIGNMENTS")
print("="*60)
print(f"\nShowing first 10 of {len(assignments)} responses:\n")
assignments.head(10)

## 9. OUTPUT 2: Codebook

Complete codebook with definitions and examples.

In [None]:
# Get detailed codebook
codebook = results.get_codebook_detailed()

print("CODEBOOK")
print("="*60)
print(f"\n{len(codebook)} codes identified:\n")
codebook

## 10. OUTPUT 3: Code Frequency Table

In [None]:
# Get frequency table
freq_table = results.get_frequency_table()

print("CODE FREQUENCY TABLE")
print("="*60)
freq_table

## 11. OUTPUT 4: Confidence & Quality Metrics

In [None]:
# Get quality metrics
quality_metrics = results.get_quality_metrics()

print("QUALITY & CONFIDENCE METRICS")
print("="*60)
for metric, value in quality_metrics.items():
    print(f"{metric:.<40} {value}")

## 12. OUTPUT 5: Binary/Multi-Label Matrix

In [None]:
# Get binary matrix
binary_matrix = results.get_binary_matrix()

print("BINARY CODE MATRIX")
print("="*60)
print(f"Shape: {binary_matrix.shape}")
print(f"\nFirst 10 rows:\n")
binary_matrix.head(10)

## 13. OUTPUT 6: Representative Quotes

In [None]:
# Get representative quotes
quotes = results.get_representative_quotes(top_n=5)

print("REPRESENTATIVE QUOTES")
print("="*60)

for code_id, data in list(quotes.items())[:3]:  # Show first 3 codes
    print(f"\n{code_id}: {data['label']}")
    print("-" * 60)
    for i, quote in enumerate(data['quotes'], 1):
        print(f"{i}. [{quote['confidence']:.2f}] {quote['text'][:100]}...")
    print()

## 14. OUTPUT 7: Co-Occurrence Analysis

In [None]:
# Get co-occurrence pairs
cooccurrence = results.get_cooccurrence_pairs(min_count=2)

print("CO-OCCURRENCE ANALYSIS")
print("="*60)
print(f"\nTop code pairs that appear together:\n")
cooccurrence.head(10)

In [None]:
# Get full co-occurrence matrix
cooccur_matrix = results.get_cooccurrence_matrix()

print("\nCo-occurrence Matrix:")
cooccur_matrix

## 15. OUTPUT 8: Descriptive Statistics

In [None]:
# Get descriptive statistics
desc_stats = results.get_descriptive_stats()

print("DESCRIPTIVE STATISTICS")
print("="*60)
desc_stats

## 16. OUTPUT 9: Segmentation Analysis

(Optional - requires demographic column in data)

In [None]:
# Example: If you have a demographic column
# Uncomment and modify based on your data

# if 'age_group' in df.columns:
#     seg_analysis = results.get_segmentation_analysis('age_group')
#     print("SEGMENTATION ANALYSIS BY AGE GROUP")
#     print("="*60)
#     seg_analysis.head(20)
# else:
#     print("No demographic columns available for segmentation")

print("Segmentation analysis requires demographic columns in your data.")
print("Example columns: 'age_group', 'gender', 'department', etc.")

## 17. OUTPUT 10: Quality Assurance Report

In [None]:
# Get QA report
qa_report = results.get_qa_report()

print("QUALITY ASSURANCE REPORT")
print("="*60)
print(f"\nTimestamp: {qa_report['timestamp']}")
print(f"Method: {qa_report['method']}")
print(f"Total Responses: {qa_report['total_responses']:,}")
print(f"\nQuality Issues:")
print(f"  - Low Confidence Assignments: {qa_report['low_confidence_count']}")
print(f"  - Uncoded Responses: {qa_report['uncoded_count']}")
print(f"  - Multi-coded Responses: {qa_report['multi_coded_count']}")
print(f"\nQuality Metrics:")
for k, v in qa_report['quality_metrics'].items():
    print(f"  {k}: {v}")

## 18. OUTPUT 11: Visualizations

In [None]:
# Create visualizer
viz = CodingVisualizer(results)

In [None]:
# 1. Frequency Chart
fig = viz.plot_frequency_chart(top_n=10)
fig.show()

In [None]:
# 2. Co-occurrence Heatmap
fig = viz.plot_cooccurrence_heatmap()
fig.show()

In [None]:
# 3. Network Diagram
fig = viz.plot_network_diagram(min_cooccurrence=2)
if fig:
    fig.show()

In [None]:
# 4. Distribution Histogram
fig = viz.plot_distribution_histogram()
fig.show()

In [None]:
# 5. Confidence Distribution
fig = viz.plot_confidence_distribution()
fig.show()

In [None]:
# 6. Word Cloud
fig = viz.plot_wordcloud()
if fig:
    plt.show()

## 19. OUTPUT 14: Uncoded & Ambiguous Responses

In [None]:
# Uncoded responses
uncoded = results.get_uncoded_responses()

print("UNCODED RESPONSES")
print("="*60)
print(f"Total: {len(uncoded)}")
if len(uncoded) > 0:
    print(f"\nSample:")
    uncoded.head()

In [None]:
# Low confidence responses
low_conf = results.get_low_confidence_responses(threshold=0.5)

print("LOW CONFIDENCE RESPONSES")
print("="*60)
print(f"Total: {len(low_conf)}")
if len(low_conf) > 0:
    print(f"\nSample:")
    low_conf.head()

In [None]:
# Ambiguous (multi-coded) responses
ambiguous = results.get_ambiguous_responses(min_codes=3)

print("AMBIGUOUS RESPONSES (3+ codes)")
print("="*60)
print(f"Total: {len(ambiguous)}")
if len(ambiguous) > 0:
    print(f"\nSample:")
    ambiguous.head()

## 20. OUTPUT 12: Export All Results

In [None]:
# Create exporter
exporter = ResultsExporter(results, output_dir='output')

# Export all formats
output_dir = exporter.export_all()

In [None]:
# Export comprehensive Excel file
excel_file = exporter.export_excel('ml_coding_results.xlsx')

## 21. OUTPUT 15: Executive Summary

In [None]:
# Generate executive summary
summary_gen = ExecutiveSummaryGenerator(results)
summary = summary_gen.generate(top_n_codes=5, top_n_quotes=3)

print(summary)

In [None]:
# Save executive summary
summary_gen.save('executive_summary.md', output_dir=output_dir)

## 22. OUTPUT 13: Method Documentation

In [None]:
# Generate method documentation
method_doc = f"""
# ML-Based Open Coding - Method Documentation

## Methodology

### Algorithm Used
- **Method:** {coder.method.upper()}
- **Number of Codes:** {coder.n_codes}
- **Minimum Confidence:** {coder.min_confidence}

### Process

1. **Text Preprocessing**
   - Lowercasing
   - Special character removal
   - Whitespace normalization

2. **Vectorization**
   - Vectorizer: {type(coder.vectorizer).__name__}
   - Max Features: 1000
   - Stop Words: Removed

3. **Model Training**
   - Algorithm: {type(coder.model).__name__}
   - Components/Clusters: {coder.n_codes}

4. **Code Assignment**
   - Threshold: {coder.min_confidence}
   - Multi-label: Yes
   - Confidence scoring: Enabled

### Quality Assurance

- **Coverage:** {quality_metrics.get('coverage_pct', 0):.1f}% of responses coded
- **Average Confidence:** {quality_metrics.get('avg_confidence', 0):.3f}
- **Validation:** Statistical clustering metrics computed

### Limitations

1. Automated coding may miss nuanced interpretations
2. Code labels are auto-generated from keywords
3. Confidence scores are probabilistic estimates
4. Edge cases and outliers may require manual review

### Recommendations

1. Review representative quotes for each code
2. Manually validate low-confidence assignments
3. Consider human coding for critical decisions
4. Use as exploratory tool to complement qualitative analysis

### Reproducibility

- Random Seed: 42
- Python Version: {sys.version}
- Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
"""

print(method_doc)

# Save method documentation
with open(output_dir / 'method_documentation.md', 'w') as f:
    f.write(method_doc)

print(f"\n✓ Method documentation saved to: {output_dir / 'method_documentation.md'}")

## 23. Complete Results Summary

In [None]:
print("\n" + "="*60)
print("ML-BASED OPEN CODING - COMPLETE RESULTS")
print("="*60)
print(f"\n✓ All 15 Essential Outputs Generated:\n")
print("  1. ✓ Code Assignments with confidence scores")
print("  2. ✓ Complete Codebook with examples")
print("  3. ✓ Code Frequency Tables")
print("  4. ✓ Quality & Confidence Metrics")
print("  5. ✓ Binary/Multi-Label Matrix")
print("  6. ✓ Representative Quotes")
print("  7. ✓ Co-Occurrence Analysis")
print("  8. ✓ Descriptive Statistics")
print("  9. ✓ Segmentation Analysis (if demographic data available)")
print(" 10. ✓ Quality Assurance Report")
print(" 11. ✓ Comprehensive Visualizations")
print(" 12. ✓ Multiple Export Formats (CSV, Excel, JSON)")
print(" 13. ✓ Method Documentation")
print(" 14. ✓ Uncoded & Ambiguous Responses")
print(" 15. ✓ Executive Summary")
print(f"\nOutput Directory: {output_dir}")
print(f"\nAnalysis Complete! 🎉")

---

## Next Steps

### Customization
- Adjust `n_codes` parameter to discover more/fewer themes
- Try different algorithms: `tfidf_kmeans`, `lda`, or `nmf`
- Modify `min_confidence` threshold for stricter/looser coding

### Advanced Analysis
- Add demographic segmentation if you have group variables
- Conduct temporal analysis if you have date fields
- Compare multiple datasets or time periods

### Validation
- Review representative quotes for each code
- Manually validate low-confidence assignments
- Compare with manual coding (if available)
- Use binary matrix for statistical testing

### Export & Share
- All results exported to timestamped folder
- Excel file contains all major outputs
- Executive summary ready for stakeholders
- Method documentation ensures reproducibility