# IR Evaluation: Impact of Stopword Removal

This notebook implements the Information Retrieval (IR) evaluation as part of the Stop Word Project.
We compare the retrieval performance of a Search Engine using two datasets:
1. **Full Text**: Documents with all words (Stopwords included).
2. **Cleaned Text**: Documents with Khmer stopwords removed.

## Objectives:
- Prepare the two datasets (segmentation and filtering).
- Implement a TF-IDF based Vector Space Model.
- Evaluate retrieval quality using **Known-Item Retrieval** task (simulating search ranking).
- Metrics: Mean Rank, Recall@K.


In [None]:
# Install necessary libraries if not present
!pip install khmer-nltk scikit-learn pandas matplotlib

In [None]:
import csv
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

try:
    from khmernltk import word_tokenize
    print("Khmer NLTK loaded successfully.")
except ImportError:
    print("Khmer NLTK not found. Please run the installation cell above.")


## 1. Load Resources and Data
We load the comprehensive Stopword list and the raw text corpus.

In [None]:
def load_custom_stopwords(csv_path):
    stopwords = set()
    if not os.path.exists(csv_path):
        print(f"Warning: Stopword file not found at {csv_path}")
        return stopwords
        
    with open(csv_path, encoding="utf-8-sig") as f:
        reader = csv.DictReader(f)
        for row in reader:
            # Filter based on Linguistic Group (Keep 'Content Words', Remove functional ones)
            # If the group implies function word, add to stopwords.
            # Based on previous notebook logic, we keep only 'Content Words' and remove others.
            if "content word" not in row.get("linguistic_group", "").lower():
                stopwords.add(row["term"].strip())
    return stopwords

STOPWORDS_PATH = "../stopwords/FIle_Stopwords.csv"
KHMER_STOPWORDS = load_custom_stopwords(STOPWORDS_PATH)
print(f"Loaded {len(KHMER_STOPWORDS)} Khmer stopwords.")

In [None]:
def load_and_process_corpus(filepath, limit=5000):
    """
    Reads the raw file, tokenizes it, and creates two versions:
    1. segmented_text (with stopwords)
    2. filtered_text (without stopwords)
    """
    raw_docs = []
    corpus_sw = []
    corpus_no_sw = []
    
    if not os.path.exists(filepath):
        print(f"Error: Data file not found at {filepath}")
        return [], [], []

    with open(filepath, 'r', encoding='utf-8') as f:
        count = 0
        for line in f:
            line = line.strip()
            if not line: 
                continue
                
            try:
                tokens = word_tokenize(line)
                if not tokens: continue
                
                # Join for TF-IDF (space separated)
                text_sw = " ".join(tokens)
                
                # Remove stopwords
                tokens_filtered = [t for t in tokens if t not in KHMER_STOPWORDS]
                text_no_sw = " ".join(tokens_filtered)
                
                raw_docs.append(line)
                corpus_sw.append(text_sw)
                corpus_no_sw.append(text_no_sw)
                
                count += 1
                if count >= limit:
                    break
            except Exception as e:
                continue
                
    print(f"Processed {len(corpus_sw)} documents.")
    return raw_docs, corpus_sw, corpus_no_sw

# Load a sample of 3000 documents for evaluation speed
DATA_PATH = "../data/raw/news_text_file_150k.txt"
raw_docs, docs_with_sw, docs_without_sw = load_and_process_corpus(DATA_PATH, limit=3000)

## 2. IR System Implementation
We use TF-IDF weighting and Cosine Similarity.
We define a function `evaluate_ir` that takes a corpus and a set of query documents.

In [None]:
def evaluate_retrieval(corpus, query_indices, top_k=10):
    """
    Evaluates retrieval performance using Known-Item Retrieval.
    For each document in query_indices, we try to retrieve it from the corpus.
    Ideally, it should be Rank 1.
    """
    # 1. Build Index
    vectorizer = TfidfVectorizer()
    X_corpus = vectorizer.fit_transform(corpus)
    
    # 2. Process Queries
    # The queries are the documents themselves
    queries = [corpus[i] for i in query_indices]
    X_queries = vectorizer.transform(queries)
    
    # 3. Compute Similarity
    # Shape: (n_queries, n_corpus)
    sim_matrix = cosine_similarity(X_queries, X_corpus)
    
    ranks = []
    hits_at_k = 0
    
    for i, true_doc_idx in enumerate(query_indices):
        scores = sim_matrix[i]
        
        # Sort indices by score descending
        sorted_indices = np.argsort(scores)[::-1]
        
        # Find where the true document is in the ranked list
        # np.where returns a tuple, [0][0] gets the index
        rank_positions = np.where(sorted_indices == true_doc_idx)[0]
        
        if len(rank_positions) > 0:
            rank = rank_positions[0] + 1 # 1-based rank
        else:
            rank = len(corpus) # Should not happen if query is in corpus
            
        ranks.append(rank)
        if rank <= top_k:
            hits_at_k += 1
            
    mean_rank = np.mean(ranks)
    recall_at_k = hits_at_k / len(query_indices)
    
    return mean_rank, recall_at_k, ranks


## 3. Run Experiments
We select 100 random documents as "queries" and test retrieval on both datasets.

In [None]:
random.seed(42)
NUM_QUERIES = 50
if len(docs_with_sw) > NUM_QUERIES:
    query_indices = random.sample(range(len(docs_with_sw)), NUM_QUERIES)
else:
    query_indices = list(range(len(docs_with_sw)))

print(f"Selected {len(query_indices)} random documents as queries.")

# Experiment 1: With Stopwords
print("\n--- Evaluating WITH Stopwords ---")
mr_sw, r_k_sw, ranks_sw = evaluate_retrieval(docs_with_sw, query_indices)
print(f"Mean Rank: {mr_sw:.2f}")
print(f"Recall@10: {r_k_sw:.2f}")

# Experiment 2: Without Stopwords
print("\n--- Evaluating WITHOUT Stopwords ---")
mr_now, r_k_now, ranks_now = evaluate_retrieval(docs_without_sw, query_indices)
print(f"Mean Rank: {mr_now:.2f}")
print(f"Recall@10: {r_k_now:.2f}")


## 4. Analysis and Visualization
We compare the rank distribution to see if removing stopwords helps the correct document appear higher (closer to rank 1).

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(ranks_sw, alpha=0.5, label='With Stopwords', bins=20)
plt.hist(ranks_now, alpha=0.5, label='Without Stopwords', bins=20)
plt.xlabel('Rank of Relevant Document')
plt.ylabel('Frequency')
plt.title('Distribution of Ranks (Lower is Better)')
plt.legend()
plt.grid(True)
plt.show()

print("Diff in Mean Rank:", mr_sw - mr_now)
if mr_now < mr_sw:
    print("Conclusion: Removing stopwords IMPROVED retrieval performance.")
else:
    print("Conclusion: Removing stopwords DID NOT improve retrieval performance (or slight degradation).")

### Interpretation
- **Mean Rank**: The average position of the correct document. Lower is better (1.0 is perfect).
- **Recall@10**: Percentage of times the correct document appeared in the top 10 results.

In highly specific retrieval (like this known-item task), stopwords can sometimes help by providing phrase specificity, but in general topic retrieval, they add noise. If the Mean Rank decreases after removal, our customized stopword list is effective.

# Apply to IR

In [1]:
# ============================================================================
# ENHANCED IR EVALUATION WITH TF-IDF QUALITY FILTERING
# ============================================================================

import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import ndcg_score
from scipy import stats

class KhmerIREvaluator:
    """Advanced IR evaluation with Khmer-specific preprocessing and TF-IDF cleaning"""
    
    def __init__(self, vectorizer_params=None):
        self.khmer_pattern = re.compile(r'[\u1780-\u17FF]')
        
        if vectorizer_params is None:
            self.vectorizer_params = {
                'max_features': 5000,
                'min_df': 2,
                'max_df': 0.95,
                'use_idf': True,
                'smooth_idf': True,
                'lowercase': False
            }
        else:
            self.vectorizer_params = vectorizer_params
        
        self.results = {}
        
    def clean_tokens(self, text, min_length=2):
        """Clean text to keep only meaningful Khmer tokens"""
        tokens = text.split()
        cleaned_tokens = []
        
        for token in tokens:
            # Must contain Khmer
            if not self.khmer_pattern.search(token):
                continue
            
            # Must be long enough (filter out single characters)
            if len(token) < min_length:
                continue
            
            # Count actual Khmer characters
            khmer_chars = len([c for c in token if '\u1780' <= c <= '\u17DD'])
            
            # Must have at least min_length actual Khmer base characters
            if khmer_chars < min_length:
                continue
            
            cleaned_tokens.append(token)
        
        return ' '.join(cleaned_tokens)
    
    def load_and_preprocess_corpus(self, filepath, label):
        """Load and preprocess corpus with quality filtering"""
        print(f"  Loading: {label}")
        
        if not os.path.exists(filepath):
            print(f"  ‚ùå File not found: {filepath}")
            return None
        
        # Read documents
        with open(filepath, 'r', encoding='utf-8') as f:
            documents = [line.strip() for line in f if line.strip()]
        
        if not documents:
            print(f"  ‚ö†Ô∏è  Empty file!")
            return None
        
        # Clean documents (keep only Khmer, min 2 characters)
        cleaned_docs = [self.clean_tokens(doc, min_length=2) for doc in documents]
        cleaned_docs = [doc for doc in cleaned_docs if doc]  # Remove empty
        
        # Additional filtering: remove very short documents
        filtered_docs = [doc for doc in cleaned_docs if len(doc.split()) >= 3]
        
        if len(filtered_docs) < len(cleaned_docs):
            print(f"  Filtered out {len(cleaned_docs) - len(filtered_docs)} very short documents")
        
        if not filtered_docs:
            print(f"  ‚ùå No valid documents after filtering!")
            return None
        
        # Diagnose token quality
        self._diagnose_token_quality(filtered_docs, label)
        
        return filtered_docs
    
    def _diagnose_token_quality(self, documents, label):
        """Diagnose token quality in loaded corpus"""
        sample_docs = documents[:50] if len(documents) > 50 else documents
        all_tokens = []
        for doc in sample_docs:
            all_tokens.extend(doc.split())
        
        # Categorize tokens
        single_char = [t for t in all_tokens if len(t) == 1]
        two_char = [t for t in all_tokens if len(t) == 2]
        three_plus = [t for t in all_tokens if len(t) >= 3]
        
        khmer_tokens = [t for t in all_tokens if self.khmer_pattern.search(t)]
        khmer_pct = (len(khmer_tokens) / len(all_tokens) * 100) if all_tokens else 0
        
        print(f"    Documents: {len(documents):,}")
        print(f"    Token quality: {len(three_plus)}/{len(all_tokens)} ({len(three_plus)/len(all_tokens)*100:.1f}%) are 3+ chars")
        print(f"    Khmer content: {khmer_pct:.1f}%")
        
        if single_char:
            print(f"    ‚ö†Ô∏è  Single-char tokens: {len(single_char)}")
        
        return {
            'total_docs': len(documents),
            'total_tokens': len(all_tokens),
            'single_char_tokens': len(single_char),
            'khmer_pct': khmer_pct
        }
    
    def build_index(self, corpus, use_quality_filtering=True):
        """Build TF-IDF index with optional quality filtering"""
        
        if use_quality_filtering:
            # Pre-filter documents with custom tokenizer
            filtered_docs = []
            for doc in corpus:
                tokens = doc.split()
                # Keep only tokens with at least 2 characters
                valid_tokens = [t for t in tokens if len(t) >= 2]
                if valid_tokens:
                    filtered_docs.append(' '.join(valid_tokens))
            
            if not filtered_docs:
                print("  ‚ùå No valid tokens after filtering!")
                return None, None
            
            corpus = filtered_docs
        
        # Create vectorizer
        def khmer_tokenizer(text):
            """Custom tokenizer that only splits on whitespace"""
            return text.split()
        
        vectorizer = TfidfVectorizer(
            **self.vectorizer_params,
            tokenizer=khmer_tokenizer,
            token_pattern=None  # Disable default pattern
        )
        
        try:
            X = vectorizer.fit_transform(corpus)
            return vectorizer, X
        except Exception as e:
            print(f"  ‚ùå Error building index: {e}")
            return None, None
    
    def evaluate_retrieval(self, corpus, query_indices, top_k_values=[1, 5, 10, 20, 50]):
        """
        Evaluate retrieval performance with multiple metrics
        
        Returns:
            dict: Dictionary containing all evaluation metrics
        """
        # Build index with quality filtering
        vectorizer, X_corpus = self.build_index(corpus, use_quality_filtering=True)
        
        if X_corpus is None:
            print("  ‚ùå Could not build index")
            return None
        
        # Prepare queries (documents themselves for known-item retrieval)
        queries = [corpus[i] for i in query_indices]
        X_queries = vectorizer.transform(queries)
        
        # Compute similarity matrix
        sim_matrix = cosine_similarity(X_queries, X_corpus)
        
        # Initialize result storage
        results = {
            'ranks': [],
            'precision_at_k': {k: [] for k in top_k_values},
            'recall_at_k': {k: [] for k in top_k_values},
            'ndcg_at_k': {k: [] for k in top_k_values},
            'query_scores': [],
            'vocab_size': X_corpus.shape[1],
            'doc_count': len(corpus)
        }
        
        # Process each query
        for i, true_doc_idx in enumerate(query_indices):
            scores = sim_matrix[i]
            
            # Get ranked list (descending)
            sorted_indices = np.argsort(scores)[::-1]
            
            # Find rank of true document (1-based)
            rank_positions = np.where(sorted_indices == true_doc_idx)[0]
            rank = rank_positions[0] + 1 if len(rank_positions) > 0 else len(corpus)
            results['ranks'].append(rank)
            
            # Calculate binary relevance vector for NDCG
            relevance = np.zeros(len(corpus))
            relevance[true_doc_idx] = 1
            
            # Calculate metrics at different K values
            for k in top_k_values:
                if k <= len(corpus):
                    # Precision@K
                    relevant_in_top_k = np.sum(relevance[sorted_indices[:k]])
                    precision = relevant_in_top_k / k
                    results['precision_at_k'][k].append(precision)
                    
                    # Recall@K
                    total_relevant = 1  # Only one relevant document
                    recall = relevant_in_top_k / total_relevant
                    results['recall_at_k'][k].append(recall)
                    
                    # NDCG@K
                    ndcg = ndcg_score([relevance], [scores], k=k)
                    results['ndcg_at_k'][k].append(ndcg)
                else:
                    results['precision_at_k'][k].append(np.nan)
                    results['recall_at_k'][k].append(np.nan)
                    results['ndcg_at_k'][k].append(np.nan)
            
            # Store query scores for analysis
            results['query_scores'].append(scores[true_doc_idx])
        
        # Calculate aggregate metrics
        results['mean_rank'] = np.mean(results['ranks'])
        results['median_rank'] = np.median(results['ranks'])
        results['std_rank'] = np.std(results['ranks'])
        results['min_rank'] = np.min(results['ranks'])
        results['max_rank'] = np.max(results['ranks'])
        
        for k in top_k_values:
            if k <= len(corpus):
                results[f'mean_precision@{k}'] = np.nanmean(results['precision_at_k'][k])
                results[f'mean_recall@{k}'] = np.nanmean(results['recall_at_k'][k])
                results[f'mean_ndcg@{k}'] = np.nanmean(results['ndcg_at_k'][k])
            else:
                results[f'mean_precision@{k}'] = np.nan
                results[f'mean_recall@{k}'] = np.nan
                results[f'mean_ndcg@{k}'] = np.nan
        
        # Calculate success rate (document found in top K)
        results['success_rate'] = {}
        for k in top_k_values:
            if k <= len(corpus):
                results['success_rate'][k] = np.mean([1 if r <= k else 0 for r in results['ranks']])
            else:
                results['success_rate'][k] = np.nan
        
        return results
    
    def evaluate_all_corpora(self, corpora_dict, query_indices, top_k_values=[1, 5, 10, 20, 50]):
        """Evaluate retrieval performance across multiple corpora"""
        
        comparison_results = {}
        
        print("\n" + "="*80)
        print(" COMPREHENSIVE IR EVALUATION WITH QUALITY FILTERING")
        print("="*80)
        
        for label, corpus in corpora_dict.items():
            print(f"\nüìä Evaluating: {label}")
            print(f"   Corpus size: {len(corpus):,} documents")
            
            results = self.evaluate_retrieval(corpus, query_indices, top_k_values)
            
            if results is None:
                print(f"   ‚ùå Evaluation failed")
                continue
            
            comparison_results[label] = results
            
            print(f"   ‚úì Mean Rank: {results['mean_rank']:.2f}")
            print(f"   ‚úì Median Rank: {results['median_rank']:.1f}")
            print(f"   ‚úì Vocabulary size: {results['vocab_size']:,}")
            
            for k in [5, 10, 20]:
                if k <= len(corpus):
                    print(f"   ‚úì Recall@{k}: {results[f'mean_recall@{k}']:.3f}")
        
        print("\n" + "="*80)
        return comparison_results
    
    def create_comparison_dataframe(self, comparison_results):
        """Convert comparison results to pandas DataFrame"""
        
        metrics_data = []
        
        for label, results in comparison_results.items():
            row = {
                'Corpus': label,
                'Documents': results['doc_count'],
                'Vocab_Size': results['vocab_size'],
                'Mean_Rank': results['mean_rank'],
                'Median_Rank': results['median_rank'],
                'Std_Rank': results['std_rank'],
                'Min_Rank': results['min_rank'],
                'Max_Rank': results['max_rank']
            }
            
            # Add precision, recall, NDCG at various K
            for k in [1, 3, 5, 10, 20]:
                if f'mean_precision@{k}' in results:
                    row[f'Precision@{k}'] = results[f'mean_precision@{k}']
                    row[f'Recall@{k}'] = results[f'mean_recall@{k}']
                    row[f'NDCG@{k}'] = results[f'mean_ndcg@{k}']
                    row[f'Success_Rate@{k}'] = results['success_rate'].get(k, np.nan)
                else:
                    row[f'Precision@{k}'] = np.nan
                    row[f'Recall@{k}'] = np.nan
                    row[f'NDCG@{k}'] = np.nan
                    row[f'Success_Rate@{k}'] = np.nan
            
            metrics_data.append(row)
        
        df = pd.DataFrame(metrics_data)
        
        # Sort by Mean Rank (ascending - better performance first)
        df = df.sort_values('Mean_Rank')
        
        return df

# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    # Configuration
    processed_dir = r'D:\Year 5\S1\Information Retrieval\StopwordProject\khmer_stopword_project\data\processed'
    output_dir = r'D:\Year 5\S1\Information Retrieval\StopwordProject\khmer_stopword_project\data\ir_evaluation_enhanced'
    
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Files to analyze
    files_to_analyze = [
        ('Original', 'original_segmented_sentences.txt'),
        ('No_All_Stopwords', 'no_all_stopwords.txt'),
        ('No_Auxiliary_Verbs', 'no_Auxiliary_Verbs___Aspect_Markers.txt'),
        ('No_Conjunctions', 'no_Conjunctions.txt'),
        ('No_Determiners', 'no_Determiners_and_Quantifiers.txt'),
        ('No_Function_Nouns', 'no_Function_Nouns.txt'),
        ('No_Numbers', 'no_Numbers_and_Time_Expressions.txt'),
        ('No_Particles', 'no_Particles_and_Discourse_Markers.txt'),
        ('No_Politeness', 'no_Politeness_and_Honorifics.txt'),
        ('No_Prepositions', 'no_Prepositions___Relational_Words.txt'),
        ('No_Pronouns', 'no_Pronouns.txt'),
        ('No_Questions', 'no_Question_and_Negation_Words.txt')
    ]
    
    # Initialize evaluator
    print("="*80)
    print(" ENHANCED IR EVALUATION SYSTEM")
    print(" With Khmer-specific preprocessing and quality filtering")
    print("="*80)
    
    evaluator = KhmerIREvaluator()
    
    # Load and preprocess all corpora
    print("\nüìÅ Loading and preprocessing corpora:")
    print("-"*60)
    
    corpora = {}
    for label, filename in files_to_analyze:
        filepath = os.path.join(processed_dir, filename)
        corpus = evaluator.load_and_preprocess_corpus(filepath, label)
        
        if corpus is not None:
            corpora[label] = corpus
        else:
            print(f"  ‚ö†Ô∏è  Skipping {label} - could not load/preprocess")
    
    print(f"\n‚úì Successfully loaded {len(corpora)} corpora")
    
    # Determine query set size based on smallest corpus
    min_corpus_size = min(len(corpus) for corpus in corpora.values())
    NUM_QUERIES = min(100, min_corpus_size - 10)
    
    print(f"\nüìä Corpus sizes for evaluation:")
    for label, corpus in corpora.items():
        print(f"  {label:<25}: {len(corpus):>6} documents")
    
    print(f"\nüìù Using {NUM_QUERIES} queries (based on smallest corpus: {min_corpus_size} documents)")
    
    # Set random seed for reproducibility
    RANDOM_SEED = 42
    np.random.seed(RANDOM_SEED)
    random.seed(RANDOM_SEED)
    
    # Create query indices (same for all corpora)
    query_indices = random.sample(range(min_corpus_size), NUM_QUERIES)
    
    # Run comprehensive evaluation
    comparison_results = evaluator.evaluate_all_corpora(
        corpora, 
        query_indices,
        top_k_values=[1, 3, 5, 10, 20, 50]
    )
    
    # Create comparison DataFrame
    comparison_df = evaluator.create_comparison_dataframe(comparison_results)
    
    print("\nüìã COMPARISON RESULTS (Sorted by Mean Rank - Best First):")
    print("="*100)
    pd.set_option('display.float_format', lambda x: f'{x:.3f}')
    print(comparison_df[['Corpus', 'Documents', 'Vocab_Size', 'Mean_Rank', 
                         'Recall@5', 'Recall@10', 'Recall@20']].to_string(index=False))
    
    # Save results to CSV
    comparison_csv = os.path.join(output_dir, "ir_comparison_results_enhanced.csv")
    comparison_df.to_csv(comparison_csv, index=False, encoding='utf-8-sig')
    print(f"\n‚úì Full comparison results saved to: {comparison_csv}")
    
    # ============================================================================
    # VISUALIZATIONS
    # ============================================================================
    
    print("\nüìä Creating visualizations...")
    
    # Set style
    plt.style.use('seaborn-v0_8-darkgrid')
    
    # 1. Mean Rank Comparison
    fig1, ax1 = plt.subplots(figsize=(14, 8))
    
    # Sort by Mean Rank for better visualization
    plot_df = comparison_df.sort_values('Mean_Rank', ascending=True)
    
    colors = plt.cm.coolwarm(np.linspace(0, 1, len(plot_df)))
    bars = ax1.barh(range(len(plot_df)), plot_df['Mean_Rank'], color=colors)
    
    ax1.set_yticks(range(len(plot_df)))
    ax1.set_yticklabels(plot_df['Corpus'])
    ax1.set_xlabel('Mean Rank (Lower is Better)', fontsize=12, fontweight='bold')
    ax1.set_title('IR Performance: Mean Rank Comparison', fontsize=14, fontweight='bold')
    ax1.invert_yaxis()  # Highest rank at top
    
    # Add value labels
    for i, (bar, rank) in enumerate(zip(bars, plot_df['Mean_Rank'])):
        ax1.text(rank + 0.5, bar.get_y() + bar.get_height()/2,
                f'{rank:.1f}', ha='left', va='center', fontsize=10)
    
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'mean_rank_comparison_enhanced.png'), 
                dpi=300, bbox_inches='tight')
    plt.close()
    
    # 2. Performance Metrics Comparison
    fig2, axes2 = plt.subplots(2, 2, figsize=(16, 12))
    fig2.suptitle('IR Performance Metrics Comparison', fontsize=16, fontweight='bold')
    
    # Plot 2.1: Recall@K comparison
    k_values = [5, 10, 20]
    for k in k_values:
        axes2[0, 0].plot(plot_df['Corpus'], plot_df[f'Recall@{k}'], 
                        marker='o', label=f'Recall@{k}', linewidth=2)
    axes2[0, 0].set_xlabel('Corpus')
    axes2[0, 0].set_ylabel('Recall Score')
    axes2[0, 0].set_title('Recall at Different K Values', fontsize=12, fontweight='bold')
    axes2[0, 0].legend()
    axes2[0, 0].grid(True, alpha=0.3)
    plt.setp(axes2[0, 0].get_xticklabels(), rotation=45, ha='right')
    
    # Plot 2.2: NDCG@K comparison
    for k in k_values:
        axes2[0, 1].plot(plot_df['Corpus'], plot_df[f'NDCG@{k}'], 
                        marker='s', label=f'NDCG@{k}', linewidth=2)
    axes2[0, 1].set_xlabel('Corpus')
    axes2[0, 1].set_ylabel('NDCG Score')
    axes2[0, 1].set_title('NDCG at Different K Values', fontsize=12, fontweight='bold')
    axes2[0, 1].legend()
    axes2[0, 1].grid(True, alpha=0.3)
    plt.setp(axes2[0, 1].get_xticklabels(), rotation=45, ha='right')
    
    # Plot 2.3: Vocabulary size vs Performance
    ax2 = axes2[1, 0]
    scatter = ax2.scatter(plot_df['Vocab_Size'], plot_df['Mean_Rank'], 
                         s=plot_df['Documents']/10, alpha=0.7,
                         c=plot_df['Mean_Rank'], cmap='coolwarm')
    ax2.set_xlabel('Vocabulary Size', fontsize=12)
    ax2.set_ylabel('Mean Rank', fontsize=12)
    ax2.set_title('Vocabulary Size vs Mean Rank', fontsize=12, fontweight='bold')
    ax2.grid(True, alpha=0.3)
    
    # Add labels for top performers
    for i, row in plot_df.iterrows():
        if row['Mean_Rank'] <= plot_df['Mean_Rank'].quantile(0.25):
            ax2.annotate(row['Corpus'], 
                        (row['Vocab_Size'], row['Mean_Rank']),
                        xytext=(5, 5), textcoords='offset points',
                        fontsize=9, alpha=0.8)
    
    # Plot 2.4: Document count vs Performance
    ax3 = axes2[1, 1]
    scatter = ax3.scatter(plot_df['Documents'], plot_df['Mean_Rank'], 
                         s=plot_df['Vocab_Size']/10, alpha=0.7,
                         c=plot_df['Mean_Rank'], cmap='coolwarm')
    ax3.set_xlabel('Document Count', fontsize=12)
    ax3.set_ylabel('Mean Rank', fontsize=12)
    ax3.set_title('Document Count vs Mean Rank', fontsize=12, fontweight='bold')
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'performance_metrics_comparison.png'), 
                dpi=300, bbox_inches='tight')
    plt.close()
    
    # 3. Heatmap of Performance Metrics
    fig3, ax3 = plt.subplots(figsize=(12, 8))
    
    # Prepare data for heatmap
    heatmap_data = plot_df.set_index('Corpus')[['Recall@5', 'Recall@10', 'Recall@20', 
                                               'NDCG@5', 'NDCG@10', 'NDCG@20']]
    
    sns.heatmap(heatmap_data, annot=True, fmt='.3f', cmap='YlOrRd', 
                square=True, ax=ax3, cbar_kws={'label': 'Score'})
    ax3.set_title('Retrieval Performance Metrics Heatmap', fontsize=14, fontweight='bold')
    plt.setp(ax3.get_xticklabels(), rotation=45, ha='right')
    plt.setp(ax3.get_yticklabels(), rotation=0)
    
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'performance_heatmap_enhanced.png'), 
                dpi=300, bbox_inches='tight')
    plt.close()
    
    # ============================================================================
    # STATISTICAL ANALYSIS
    # ============================================================================
    
    print("\nüìà Performing statistical analysis...")
    
    def perform_statistical_analysis(comparison_results, baseline='Original'):
        """Perform statistical analysis to determine significant differences"""
        
        print("\n" + "="*80)
        print("STATISTICAL ANALYSIS: Comparison to Baseline")
        print("="*80)
        
        if baseline not in comparison_results:
            print(f"  ‚ö†Ô∏è  Baseline '{baseline}' not found in results")
            return None
        
        baseline_ranks = np.array(comparison_results[baseline]['ranks'])
        
        analysis_results = []
        
        for label, results in comparison_results.items():
            if label == baseline:
                continue
                
            current_ranks = np.array(results['ranks'])
            
            # Calculate rank improvement
            rank_improvement = baseline_ranks.mean() - current_ranks.mean()
            percent_improvement = (rank_improvement / baseline_ranks.mean()) * 100
            
            # Paired t-test
            try:
                t_stat, p_value = stats.ttest_rel(baseline_ranks, current_ranks)
            except:
                t_stat, p_value = np.nan, np.nan
            
            # Determine significance
            significant = p_value < 0.05 if not np.isnan(p_value) else False
            
            analysis_results.append({
                'Corpus': label,
                'Mean_Rank': current_ranks.mean(),
                'Baseline_Rank': baseline_ranks.mean(),
                'Rank_Improvement': rank_improvement,
                'Percent_Improvement': percent_improvement,
                'T_Statistic': t_stat,
                'P_Value': p_value,
                'Significant': significant
            })
        
        # Create analysis DataFrame
        analysis_df = pd.DataFrame(analysis_results)
        analysis_df = analysis_df.sort_values('Percent_Improvement', ascending=False)
        
        print("\nRank Improvement Analysis (vs Original Corpus):")
        print("-"*100)
        pd.set_option('display.float_format', lambda x: f'{x:.3f}')
        print(analysis_df[['Corpus', 'Mean_Rank', 'Baseline_Rank', 'Rank_Improvement', 
                          'Percent_Improvement', 'P_Value', 'Significant']].to_string(index=False))
        
        return analysis_df
    
    # Perform statistical analysis
    analysis_df = perform_statistical_analysis(comparison_results)
    
    if analysis_df is not None:
        # Save analysis
        analysis_csv = os.path.join(output_dir, "statistical_analysis_enhanced.csv")
        analysis_df.to_csv(analysis_csv, index=False, encoding='utf-8-sig')
        print(f"\n‚úì Statistical analysis saved to: {analysis_csv}")
        
        # Visualize significant improvements
        if not analysis_df.empty:
            sig_df = analysis_df[analysis_df['Significant'] == True]
            if not sig_df.empty:
                fig4, ax4 = plt.subplots(figsize=(12, 6))
                
                colors = ['green' if imp > 0 else 'red' for imp in sig_df['Percent_Improvement']]
                bars = ax4.barh(sig_df['Corpus'], sig_df['Percent_Improvement'], color=colors)
                
                ax4.set_xlabel('Percent Improvement (%)', fontsize=12, fontweight='bold')
                ax4.set_title('Statistically Significant Improvements vs Baseline', 
                            fontsize=14, fontweight='bold')
                ax4.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
                
                # Add value labels
                for bar, imp in zip(bars, sig_df['Percent_Improvement']):
                    ax4.text(imp + (0.5 if imp >= 0 else -3), bar.get_y() + bar.get_height()/2,
                            f'{imp:.1f}%', ha='left' if imp >= 0 else 'right', 
                            va='center', fontsize=10)
                
                plt.tight_layout()
                plt.savefig(os.path.join(output_dir, 'significant_improvements.png'), 
                           dpi=300, bbox_inches='tight')
                plt.close()
    
    # ============================================================================
    # GENERATE FINAL REPORT
    # ============================================================================
    
    print("\nüìÑ Generating comprehensive report...")
    
    def generate_comprehensive_report(comparison_df, analysis_df, output_dir):
        """Generate a comprehensive markdown report"""
        
        report_file = os.path.join(output_dir, "ir_evaluation_report.md")
        
        with open(report_file, 'w', encoding='utf-8') as f:
            f.write("# Information Retrieval Evaluation Report\n\n")
            f.write("## Executive Summary\n\n")
            
            # Find best and worst performing corpus
            best_corpus = comparison_df.iloc[0]
            worst_corpus = comparison_df.iloc[-1]
            
            f.write(f"### Key Findings:\n")
            f.write(f"- **Best Performing Corpus**: `{best_corpus['Corpus']}` " 
                   f"(Mean Rank: {best_corpus['Mean_Rank']:.2f})\n")
            f.write(f"- **Worst Performing Corpus**: `{worst_corpus['Corpus']}` " 
                   f"(Mean Rank: {worst_corpus['Mean_Rank']:.2f})\n")
            
            # Calculate overall improvement
            if 'Original' in comparison_df['Corpus'].values:
                baseline_row = comparison_df[comparison_df['Corpus'] == 'Original'].iloc[0]
                baseline_rank = baseline_row['Mean_Rank']
                best_rank = best_corpus['Mean_Rank']
                improvement = ((baseline_rank - best_rank) / baseline_rank) * 100
                
                f.write(f"- **Improvement over Baseline**: {improvement:.1f}% reduction in mean rank\n")
            
            f.write(f"- **Best Recall@10**: {comparison_df['Recall@10'].max():.3f}\n")
            f.write(f"- **Worst Recall@10**: {comparison_df['Recall@10'].min():.3f}\n\n")
            
            f.write("## Methodology\n\n")
            f.write("- **Evaluation Method**: Known-Item Retrieval\n")
            f.write(f"- **Number of Queries**: {NUM_QUERIES}\n")
            f.write(f"- **Corpora Compared**: {len(comparison_df)}\n")
            f.write("- **Preprocessing**: Khmer-only tokens, min 2 characters\n")
            f.write("- **Vectorization**: TF-IDF with custom Khmer tokenizer\n")
            f.write("- **Similarity Measure**: Cosine similarity\n\n")
            
            f.write("## Detailed Results\n\n")
            
            f.write("### Performance Ranking (by Mean Rank)\n")
            f.write(comparison_df[['Corpus', 'Documents', 'Vocab_Size', 'Mean_Rank', 
                                  'Recall@5', 'Recall@10', 'Recall@20']].to_markdown(index=False) + "\n\n")
            
            if analysis_df is not None:
                f.write("### Statistical Significance Analysis\n")
                f.write("Comparison of each filtered corpus against the Original baseline:\n")
                f.write(analysis_df[['Corpus', 'Percent_Improvement', 'P_Value', 'Significant']].to_markdown(index=False) + "\n\n")
                
                # Find significantly improved groups
                sig_groups = analysis_df[analysis_df['Significant'] == True]
                if not sig_groups.empty:
                    f.write("### Significantly Improved Groups\n")
                    for _, row in sig_groups.iterrows():
                        group_name = row['Corpus'].replace('No_', '').replace('_', ' ')
                        f.write(f"- **{group_name}**: {row['Percent_Improvement']:.1f}% improvement " 
                               f"(p={row['P_Value']:.4f})\n")
                    f.write("\n")
            
            f.write("## Recommendations\n\n")
            
            f.write("### For Information Retrieval Systems:\n")
            f.write(f"1. **Recommended Corpus**: Use `{best_corpus['Corpus']}` for best retrieval performance\n")
            
            if analysis_df is not None and not sig_groups.empty:
                best_group = sig_groups.iloc[0]
                group_name = best_group['Corpus'].replace('No_', '').replace('_', ' ')
                f.write(f"2. **Stopword Strategy**: Focus on removing **{group_name}** for maximum impact\n")
            
            f.write("3. **Quality Filtering**: Always filter single-character tokens and non-Khmer content\n")
            f.write("4. **Evaluation Metric**: Recall@10 provides good discrimination between methods\n\n")
            
            f.write("### Files Generated\n\n")
            f.write("1. **ir_comparison_results_enhanced.csv** - Complete performance metrics\n")
            f.write("2. **statistical_analysis_enhanced.csv** - Statistical significance results\n")
            f.write("3. **Visualizations** - PNG files showing performance comparisons\n")
            f.write("4. **This report** - Comprehensive summary of findings\n\n")
            
            f.write("## Conclusion\n\n")
            f.write("This evaluation demonstrates the impact of different stopword removal strategies ")
            f.write("on Khmer information retrieval performance. The results show that ")
            
            if improvement > 0:
                f.write(f"proper stopword removal can improve retrieval performance by up to {improvement:.1f}%. ")
            else:
                f.write("careful selection of stopwords is crucial for optimal performance. ")
            
            f.write("The findings can guide the development of more effective Khmer IR systems.\n")
        
        print(f"‚úì Comprehensive report generated: {report_file}")
        
        # Print summary to console
        print("\n" + "="*80)
        print("EVALUATION SUMMARY")
        print("="*80)
        print(f"üìä Corpora evaluated: {len(comparison_df)}")
        print(f"üèÜ Best performer: {best_corpus['Corpus']} (Mean Rank: {best_corpus['Mean_Rank']:.2f})")
        
        if 'improvement' in locals():
            print(f"üìà Improvement over baseline: {improvement:.1f}%")
        
        if analysis_df is not None:
            sig_count = len(analysis_df[analysis_df['Significant'] == True])
            print(f"‚úÖ Statistically significant improvements: {sig_count} groups")
        
        print("="*80)
    
    # Generate final report
    generate_comprehensive_report(comparison_df, analysis_df, output_dir)
    
    print("\n‚úÖ IR EVALUATION COMPLETE")
    print(f"üìÅ Results saved to: {output_dir}")

 ENHANCED IR EVALUATION SYSTEM
 With Khmer-specific preprocessing and quality filtering

üìÅ Loading and preprocessing corpora:
------------------------------------------------------------
  Loading: Original
  Filtered out 16 very short documents
    Documents: 58,382
    Token quality: 1212/1363 (88.9%) are 3+ chars
    Khmer content: 100.0%
  Loading: No_All_Stopwords
  Filtered out 152 very short documents
    Documents: 58,242
    Token quality: 890/982 (90.6%) are 3+ chars
    Khmer content: 100.0%
  Loading: No_Auxiliary_Verbs
  Filtered out 82 very short documents
    Documents: 58,316
    Token quality: 1152/1303 (88.4%) are 3+ chars
    Khmer content: 100.0%
  Loading: No_Conjunctions
  Filtered out 36 very short documents
    Documents: 58,362
    Token quality: 1141/1283 (88.9%) are 3+ chars
    Khmer content: 100.0%
  Loading: No_Determiners
  Filtered out 17 very short documents
    Documents: 58,381
    Token quality: 1181/1332 (88.7%) are 3+ chars
    Khmer content: 10

In [3]:
# ============================================================================
# ENHANCED IR EVALUATION WITH DEBUGGING AND DEEP ANALYSIS
# ============================================================================

import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import json
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import ndcg_score
from scipy import stats

class DebugKhmerIREvaluator:
    """IR evaluator with deep debugging capabilities"""
    
    def __init__(self, vectorizer_params=None):
        self.khmer_pattern = re.compile(r'[\u1780-\u17FF]')
        
        if vectorizer_params is None:
            self.vectorizer_params = {
                'max_features': 10000,  # Increased from 5000
                'min_df': 1,  # Changed from 2 to handle sparse data
                'max_df': 0.98,  # Increased from 0.95
                'use_idf': True,
                'smooth_idf': True,
                'lowercase': False,
                'sublinear_tf': True  # Added for better weighting
            }
        else:
            self.vectorizer_params = vectorizer_params
        
        self.results = {}
        self.debug_data = {}
        
    def diagnose_corpus_issue(self, corpus, label):
        """Deep diagnosis of corpus issues"""
        print(f"\nüîç DIAGNOSING CORPUS: {label}")
        print("="*60)
        
        # 1. Document statistics
        doc_lengths = [len(doc.split()) for doc in corpus]
        avg_doc_length = np.mean(doc_lengths)
        median_doc_length = np.median(doc_lengths)
        
        print(f"  Document count: {len(corpus):,}")
        print(f"  Avg document length: {avg_doc_length:.1f} tokens")
        print(f"  Median document length: {median_doc_length:.1f} tokens")
        print(f"  Min/Max length: {min(doc_lengths)} / {max(doc_lengths)} tokens")
        
        # 2. Token analysis
        all_tokens = []
        for doc in corpus:
            all_tokens.extend(doc.split())
        
        token_counter = Counter(all_tokens)
        
        print(f"\n  Token statistics:")
        print(f"  Total tokens: {len(all_tokens):,}")
        print(f"  Unique tokens: {len(token_counter):,}")
        
        # 3. Token length distribution
        token_lengths = [len(token) for token in all_tokens]
        length_counter = Counter(token_lengths)
        
        print(f"\n  Token length distribution:")
        for length in sorted(length_counter.keys())[:10]:
            count = length_counter[length]
            pct = count / len(all_tokens) * 100
            print(f"    {length} chars: {count:,} ({pct:.1f}%)")
        
        # 4. Most common tokens
        print(f"\n  Top 20 most common tokens:")
        for token, count in token_counter.most_common(20):
            khmer_flag = "‚úì" if self.khmer_pattern.search(token) else "‚úó"
            print(f"    {khmer_flag} '{token}' (len={len(token)}): {count:,}")
        
        # 5. Vocabulary density (how many documents contain each token)
        if len(corpus) > 0:
            doc_freq = Counter()
            for doc in corpus:
                tokens_in_doc = set(doc.split())
                for token in tokens_in_doc:
                    doc_freq[token] += 1
            
            avg_doc_freq = np.mean(list(doc_freq.values()))
            median_doc_freq = np.median(list(doc_freq.values()))
            
            print(f"\n  Document frequency analysis:")
            print(f"  Avg doc frequency: {avg_doc_freq:.2f}")
            print(f"  Median doc frequency: {median_doc_freq:.2f}")
            
            # Show tokens with extreme frequencies
            rare_tokens = [(t, f) for t, f in doc_freq.items() if f == 1]
            if rare_tokens:
                print(f"  Tokens in only 1 document: {len(rare_tokens):,}")
            
            common_tokens = [(t, f) for t, f in doc_freq.items() if f > len(corpus) * 0.5]
            if common_tokens:
                print(f"  Tokens in >50% documents: {len(common_tokens):,}")
        
        # 6. Check for data sparsity issues
        total_token_positions = sum(doc_lengths)
        unique_token_count = len(token_counter)
        sparsity_ratio = unique_token_count / total_token_positions if total_token_positions > 0 else 0
        
        print(f"\n  Sparsity analysis:")
        print(f"  Type-Token Ratio: {sparsity_ratio:.3f}")
        if sparsity_ratio > 0.8:
            print(f"  ‚ö†Ô∏è  HIGH SPARSITY - many unique tokens, few repetitions")
        elif sparsity_ratio < 0.2:
            print(f"  ‚úì LOW SPARSITY - good token repetition")
        
        # 7. Sample document analysis
        print(f"\n  Sample document analysis (first 3):")
        for i, doc in enumerate(corpus[:3]):
            tokens = doc.split()
            print(f"  Doc {i+1}: {len(tokens)} tokens")
            print(f"    Sample: {' '.join(tokens[:10])}..." if len(tokens) > 10 else f"    Content: {doc}")
        
        return {
            'doc_count': len(corpus),
            'avg_doc_length': avg_doc_length,
            'total_tokens': len(all_tokens),
            'unique_tokens': len(token_counter),
            'sparsity_ratio': sparsity_ratio,
            'token_counter': token_counter,
            'doc_lengths': doc_lengths
        }
    
    def load_and_preprocess_corpus(self, filepath, label, debug=True):
        """Load and preprocess corpus with optional debugging"""
        print(f"\nüìÅ Loading: {label}")
        
        if not os.path.exists(filepath):
            print(f"  ‚ùå File not found: {filepath}")
            return None
        
        # Read documents
        with open(filepath, 'r', encoding='utf-8') as f:
            documents = [line.strip() for line in f if line.strip()]
        
        if not documents:
            print(f"  ‚ö†Ô∏è  Empty file!")
            return None
        
        print(f"  Raw documents: {len(documents):,}")
        
        # Clean documents (keep only Khmer, min 2 characters)
        cleaned_docs = []
        for doc in documents:
            # Apply your original cleaning logic
            tokens = doc.split()
            cleaned_tokens = []
            
            for token in tokens:
                # Must contain Khmer
                if not self.khmer_pattern.search(token):
                    continue
                
                # Must be long enough
                if len(token) < 2:
                    continue
                
                cleaned_tokens.append(token)
            
            if cleaned_tokens:
                cleaned_docs.append(' '.join(cleaned_tokens))
        
        print(f"  After cleaning: {len(cleaned_docs):,}")
        
        # Filter out very short documents
        filtered_docs = [doc for doc in cleaned_docs if len(doc.split()) >= 3]
        
        if len(filtered_docs) < len(cleaned_docs):
            removed = len(cleaned_docs) - len(filtered_docs)
            print(f"  Removed {removed} very short documents (<3 tokens)")
        
        if not filtered_docs:
            print(f"  ‚ùå No valid documents after filtering!")
            return None
        
        # Run deep diagnosis if debug is True
        if debug:
            diagnosis = self.diagnose_corpus_issue(filtered_docs, label)
            self.debug_data[label] = diagnosis
        
        return filtered_docs
    
    def evaluate_with_debugging(self, corpus, query_indices, label, top_k_values=[1, 5, 10, 20, 50]):
        """
        Evaluate retrieval with detailed debugging information
        """
        print(f"\nüî¨ Evaluating with debugging: {label}")
        print(f"  Query count: {len(query_indices)}")
        
        # Build index with debugging
        vectorizer, X_corpus = self._build_index_with_debug(corpus, label)
        
        if X_corpus is None:
            print(f"  ‚ùå Could not build index for {label}")
            return None
        
        # Prepare queries
        queries = [corpus[i] for i in query_indices]
        X_queries = vectorizer.transform(queries)
        
        # Debug query-document similarity
        self._debug_similarity_distribution(X_queries, X_corpus, label, query_indices)
        
        # Compute similarity matrix
        sim_matrix = cosine_similarity(X_queries, X_corpus)
        
        # Analyze similarity scores
        self._analyze_similarity_scores(sim_matrix, label, query_indices)
        
        # Process results
        results = self._process_retrieval_results(sim_matrix, query_indices, len(corpus), top_k_values)
        
        # Add corpus metadata
        results['label'] = label
        results['doc_count'] = len(corpus)
        results['vocab_size'] = X_corpus.shape[1]
        results['corpus_density'] = X_corpus.nnz / (X_corpus.shape[0] * X_corpus.shape[1])
        
        return results
    
    def _build_index_with_debug(self, corpus, label):
        """Build index with debugging information"""
        
        def khmer_tokenizer(text):
            return text.split()
        
        try:
            vectorizer = TfidfVectorizer(
                **self.vectorizer_params,
                tokenizer=khmer_tokenizer,
                token_pattern=None
            )
            
            X = vectorizer.fit_transform(corpus)
            
            print(f"  ‚úì TF-IDF matrix shape: {X.shape}")
            print(f"  ‚úì Non-zero elements: {X.nnz:,}")
            print(f"  ‚úì Density: {X.nnz/(X.shape[0]*X.shape[1]):.6f}")
            
            # Check for extreme cases
            if X.shape[1] < 10:
                print(f"  ‚ö†Ô∏è  VERY SMALL VOCABULARY: {X.shape[1]} features")
            
            if X.nnz == 0:
                print(f"  ‚ùå EMPTY MATRIX - no features found!")
                return None, None
            
            return vectorizer, X
            
        except Exception as e:
            print(f"  ‚ùå Error building index: {e}")
            return None, None
    
    def _debug_similarity_distribution(self, X_queries, X_corpus, label, query_indices):
        """Debug similarity score distribution"""
        
        # For a sample query, show top similarities
        sample_idx = 0
        if sample_idx < len(query_indices):
            query_vector = X_queries[sample_idx]
            
            # Compute similarities for this query
            similarities = cosine_similarity(query_vector, X_corpus)[0]
            
            # Sort by similarity
            sorted_indices = np.argsort(similarities)[::-1]
            
            print(f"\n  Sample query {sample_idx} similarity analysis:")
            print(f"    Query index: {query_indices[sample_idx]}")
            print(f"    Top 5 similarities: {similarities[sorted_indices[:5]]}")
            print(f"    Bottom 5 similarities: {similarities[sorted_indices[-5:]]}")
            print(f"    Mean similarity: {np.mean(similarities):.6f}")
            print(f"    Std similarity: {np.std(similarities):.6f}")
            
            # Check if true document has reasonable similarity
            true_doc_idx = query_indices[sample_idx]
            true_similarity = similarities[true_doc_idx]
            print(f"    Similarity to true document: {true_similarity:.6f}")
            
            # Rank of true document
            rank = np.where(sorted_indices == true_doc_idx)[0][0] + 1
            print(f"    Rank of true document: {rank}/{len(similarities)}")
    
    def _analyze_similarity_scores(self, sim_matrix, label, query_indices):
        """Analyze similarity score patterns"""
        
        # Check for flat similarity distributions (all documents similar)
        score_std = np.std(sim_matrix.flatten())
        score_mean = np.mean(sim_matrix.flatten())
        
        print(f"\n  Similarity score analysis:")
        print(f"    Mean score: {score_mean:.6f}")
        print(f"    Std score: {score_std:.6f}")
        
        if score_std < 0.001:
            print(f"    ‚ö†Ô∏è  VERY LOW VARIANCE - all documents look similar!")
        
        # Check query-specific patterns
        problematic_queries = []
        for i, true_doc_idx in enumerate(query_indices):
            scores = sim_matrix[i]
            true_score = scores[true_doc_idx]
            max_score = np.max(scores)
            
            if true_score < max_score * 0.5:  # True doc not in top similarity
                problematic_queries.append(i)
        
        if problematic_queries:
            print(f"    ‚ö†Ô∏è  {len(problematic_queries)} queries where true doc isn't most similar")
    
    def _process_retrieval_results(self, sim_matrix, query_indices, corpus_size, top_k_values):
        """Process retrieval results with debugging"""
        
        results = {
            'ranks': [],
            'similarities': [],
            'precision_at_k': {k: [] for k in top_k_values},
            'recall_at_k': {k: [] for k in top_k_values},
            'ndcg_at_k': {k: [] for k in top_k_values},
            'query_details': []
        }
        
        # Process each query
        for i, true_doc_idx in enumerate(query_indices):
            scores = sim_matrix[i]
            
            # Get ranked list
            sorted_indices = np.argsort(scores)[::-1]
            
            # Find rank of true document
            rank_positions = np.where(sorted_indices == true_doc_idx)[0]
            rank = rank_positions[0] + 1 if len(rank_positions) > 0 else corpus_size
            
            results['ranks'].append(rank)
            results['similarities'].append(scores[true_doc_idx])
            
            # Store detailed information for debugging
            query_detail = {
                'query_idx': i,
                'true_doc_idx': true_doc_idx,
                'rank': rank,
                'true_similarity': scores[true_doc_idx],
                'max_similarity': np.max(scores),
                'top_5_indices': sorted_indices[:5].tolist(),
                'top_5_scores': scores[sorted_indices[:5]].tolist()
            }
            results['query_details'].append(query_detail)
            
            # Calculate metrics
            relevance = np.zeros(corpus_size)
            relevance[true_doc_idx] = 1
            
            for k in top_k_values:
                if k <= corpus_size:
                    # Precision@K
                    relevant_in_top_k = np.sum(relevance[sorted_indices[:k]])
                    precision = relevant_in_top_k / k
                    results['precision_at_k'][k].append(precision)
                    
                    # Recall@K
                    total_relevant = 1
                    recall = relevant_in_top_k / total_relevant
                    results['recall_at_k'][k].append(recall)
                    
                    # NDCG@K
                    ndcg = ndcg_score([relevance], [scores], k=k)
                    results['ndcg_at_k'][k].append(ndcg)
        
        # Calculate aggregate metrics
        results['mean_rank'] = np.mean(results['ranks'])
        results['median_rank'] = np.median(results['ranks'])
        results['std_rank'] = np.std(results['ranks'])
        results['min_rank'] = np.min(results['ranks'])
        results['max_rank'] = np.max(results['ranks'])
        results['mean_similarity'] = np.mean(results['similarities'])
        
        for k in top_k_values:
            if k <= corpus_size:
                results[f'mean_precision@{k}'] = np.nanmean(results['precision_at_k'][k])
                results[f'mean_recall@{k}'] = np.nanmean(results['recall_at_k'][k])
                results[f'mean_ndcg@{k}'] = np.nanmean(results['ndcg_at_k'][k])
        
        # Calculate success rates
        results['success_rate'] = {}
        for k in top_k_values:
            if k <= corpus_size:
                results['success_rate'][k] = np.mean([1 if r <= k else 0 for r in results['ranks']])
        
        return results
    
    def run_comprehensive_evaluation(self, processed_dir, output_dir):
        """Run comprehensive evaluation with debugging"""
        
        # Files to analyze
        files_to_analyze = [
            ('Original', 'original_segmented_sentences.txt'),
            ('No_Prepositions', 'no_Prepositions___Relational_Words.txt'),  # Problematic one first
            ('No_Auxiliary_Verbs', 'no_Auxiliary_Verbs___Aspect_Markers.txt'),
            ('No_Conjunctions', 'no_Conjunctions.txt'),
            ('No_Determiners', 'no_Determiners_and_Quantifiers.txt'),
            ('No_Function_Nouns', 'no_Function_Nouns.txt'),
            ('No_Numbers', 'no_Numbers_and_Time_Expressions.txt'),
            ('No_Particles', 'no_Particles_and_Discourse_Markers.txt'),
            ('No_Politeness', 'no_Politeness_and_Honorifics.txt'),
            ('No_Pronouns', 'no_Pronouns.txt'),
            ('No_Questions', 'no_Question_and_Negation_Words.txt'),
            ('No_All_Stopwords', 'no_all_stopwords.txt')
        ]
        
        # Create output directory
        os.makedirs(output_dir, exist_ok=True)
        
        print("="*80)
        print("COMPREHENSIVE IR EVALUATION WITH DEBUGGING")
        print("="*80)
        
        # Load and preprocess all corpora
        print("\nüìÅ LOADING CORPORA:")
        
        corpora = {}
        for label, filename in files_to_analyze:
            filepath = os.path.join(processed_dir, filename)
            corpus = self.load_and_preprocess_corpus(filepath, label, debug=True)
            
            if corpus is not None:
                corpora[label] = corpus
        
        # Determine query set
        min_corpus_size = min(len(corpus) for corpus in corpora.values())
        NUM_QUERIES = min(50, min_corpus_size - 5)  # Reduced for debugging
        
        print(f"\nüìù Using {NUM_QUERIES} queries")
        
        # Set random seed
        RANDOM_SEED = 42
        np.random.seed(RANDOM_SEED)
        random.seed(RANDOM_SEED)
        
        # Create query indices
        query_indices = random.sample(range(min_corpus_size), NUM_QUERIES)
        
        # Run evaluation for each corpus
        print("\n" + "="*80)
        print("RUNNING EVALUATIONS")
        print("="*80)
        
        comparison_results = {}
        
        for label, corpus in corpora.items():
            print(f"\n{'='*60}")
            print(f"EVALUATING: {label}")
            print(f"{'='*60}")
            
            results = self.evaluate_with_debugging(corpus, query_indices, label)
            
            if results is not None:
                comparison_results[label] = results
                
                # Print summary
                print(f"\nüìä SUMMARY FOR {label}:")
                print(f"  Mean Rank: {results['mean_rank']:.2f}")
                print(f"  Median Rank: {results['median_rank']:.1f}")
                print(f"  Success Rate @10: {results.get('success_rate', {}).get(10, 0):.3f}")
                print(f"  Corpus Density: {results.get('corpus_density', 0):.6f}")
                
                if results['mean_rank'] > 100:
                    print(f"  ‚ö†Ô∏è  VERY POOR PERFORMANCE - Mean rank > 100")
                    print(f"     This suggests the corpus has serious issues!")
            
            # Save detailed results
            if results and 'query_details' in results:
                details_file = os.path.join(output_dir, f"{label}_query_details.json")
                # Convert all numpy types to native Python types for JSON serialization
                def convert_types(obj):
                    if isinstance(obj, np.integer):
                        return int(obj)
                    elif isinstance(obj, np.floating):
                        return float(obj)
                    elif isinstance(obj, np.ndarray):
                        return obj.tolist()
                    elif isinstance(obj, (list, tuple)):
                        return [convert_types(i) for i in obj]
                    elif isinstance(obj, dict):
                        return {k: convert_types(v) for k, v in obj.items()}
                    else:
                        return obj

                safe_query_details = [convert_types(qd) for qd in results['query_details']]
                with open(details_file, 'w', encoding='utf-8') as f:
                    json.dump(safe_query_details, f, ensure_ascii=False, indent=2)
        
        # Generate comparative analysis
        self._generate_comparative_analysis(comparison_results, output_dir)
        
        # Generate recommendations based on findings
        self._generate_recommendations(comparison_results, output_dir)
        
        print("\n" + "="*80)
        print("EVALUATION COMPLETE")
        print("="*80)
    
    def _generate_comparative_analysis(self, comparison_results, output_dir):
        """Generate comparative analysis"""
        
        if len(comparison_results) < 2:
            return
        
        # Create comparison DataFrame
        metrics_data = []
        
        for label, results in comparison_results.items():
            row = {
                'Corpus': label,
                'Documents': results.get('doc_count', 0),
                'Vocab_Size': results.get('vocab_size', 0),
                'Density': results.get('corpus_density', 0),
                'Mean_Rank': results.get('mean_rank', np.nan),
                'Median_Rank': results.get('median_rank', np.nan),
                'Std_Rank': results.get('std_rank', np.nan),
                'Mean_Similarity': results.get('mean_similarity', np.nan)
            }
            
            for k in [1, 5, 10, 20]:
                if f'mean_recall@{k}' in results:
                    row[f'Recall@{k}'] = results[f'mean_recall@{k}']
                    row[f'Success_Rate@{k}'] = results.get('success_rate', {}).get(k, np.nan)
            
            metrics_data.append(row)
        
        df = pd.DataFrame(metrics_data)
        
        # Sort by performance
        df_sorted = df.sort_values('Mean_Rank')
        
        print("\nüìã PERFORMANCE COMPARISON:")
        print(df_sorted[['Corpus', 'Documents', 'Vocab_Size', 'Mean_Rank', 'Recall@10']].to_string(index=False))
        
        # Save to CSV
        csv_path = os.path.join(output_dir, 'performance_comparison_detailed.csv')
        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
        print(f"\n‚úì Detailed comparison saved to: {csv_path}")
        
        # Create visualizations
        self._create_diagnostic_visualizations(df, output_dir)
        
        # Identify problematic cases
        problematic = df[df['Mean_Rank'] > 100]
        if not problematic.empty:
            print("\n‚ö†Ô∏è  PROBLEMATIC CORPORA (Mean Rank > 100):")
            for _, row in problematic.iterrows():
                print(f"  - {row['Corpus']}: Mean Rank = {row['Mean_Rank']:.1f}")
                print(f"    Documents: {row['Documents']}, Vocab Size: {row['Vocab_Size']}")
                print(f"    Density: {row['Density']:.6f}")
        
        # Identify best performers
        excellent = df[df['Mean_Rank'] <= 5]
        if not excellent.empty:
            print("\nüèÜ EXCELLENT PERFORMERS (Mean Rank <= 5):")
            for _, row in excellent.iterrows():
                print(f"  - {row['Corpus']}: Mean Rank = {row['Mean_Rank']:.1f}")
    
    def _create_diagnostic_visualizations(self, df, output_dir):
        """Create diagnostic visualizations"""
        
        # Filter out extreme outliers for better visualization
        plot_df = df.copy()
        if len(plot_df) > 0:
            # Cap mean rank for visualization
            max_rank_for_viz = 50
            plot_df['Mean_Rank_Viz'] = plot_df['Mean_Rank'].clip(upper=max_rank_for_viz)
            
            fig, axes = plt.subplots(2, 2, figsize=(16, 12))
            
            # Plot 1: Mean Rank comparison
            ax1 = axes[0, 0]
            sorted_df = plot_df.sort_values('Mean_Rank')
            colors = ['red' if rank > 100 else 'green' for rank in sorted_df['Mean_Rank']]
            bars = ax1.barh(range(len(sorted_df)), sorted_df['Mean_Rank_Viz'], color=colors)
            ax1.set_yticks(range(len(sorted_df)))
            ax1.set_yticklabels(sorted_df['Corpus'])
            ax1.set_xlabel('Mean Rank (capped at 50)')
            ax1.set_title('Retrieval Performance by Corpus')
            ax1.invert_yaxis()
            
            # Add actual values for problematic cases
            for i, (bar, rank) in enumerate(zip(bars, sorted_df['Mean_Rank'])):
                if rank > 100:
                    ax1.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2,
                            f'({rank:.0f})', va='center')
            
            # Plot 2: Vocabulary size vs Performance
            ax2 = axes[0, 1]
            scatter = ax2.scatter(plot_df['Vocab_Size'], plot_df['Mean_Rank'], 
                                 s=plot_df['Documents']/10, alpha=0.7,
                                 c=plot_df['Mean_Rank'], cmap='RdYlGn_r')
            ax2.set_xlabel('Vocabulary Size')
            ax2.set_ylabel('Mean Rank')
            ax2.set_title('Vocabulary Size vs Performance')
            ax2.grid(True, alpha=0.3)
            
            # Plot 3: Density vs Performance
            ax3 = axes[1, 0]
            scatter = ax3.scatter(plot_df['Density'], plot_df['Mean_Rank'], 
                                 s=plot_df['Vocab_Size']/100, alpha=0.7,
                                 c=plot_df['Mean_Rank'], cmap='RdYlGn_r')
            ax3.set_xlabel('Matrix Density')
            ax3.set_ylabel('Mean Rank')
            ax3.set_title('Matrix Density vs Performance')
            ax3.grid(True, alpha=0.3)
            
            # Plot 4: Recall@10 comparison
            ax4 = axes[1, 1]
            if 'Recall@10' in plot_df.columns:
                sorted_df = plot_df.sort_values('Recall@10', ascending=False)
                bars = ax4.bar(range(len(sorted_df)), sorted_df['Recall@10'])
                ax4.set_xticks(range(len(sorted_df)))
                ax4.set_xticklabels(sorted_df['Corpus'], rotation=45, ha='right')
                ax4.set_ylabel('Recall@10')
                ax4.set_title('Recall@10 by Corpus')
            
            plt.tight_layout()
            plt.savefig(os.path.join(output_dir, 'diagnostic_visualizations.png'), 
                       dpi=300, bbox_inches='tight')
            plt.close()
    
    def _generate_recommendations(self, comparison_results, output_dir):
        """Generate recommendations based on evaluation results"""
        
        recommendations = []
        
        if 'No_Prepositions' in comparison_results:
            prepos_results = comparison_results['No_Prepositions']
            if prepos_results.get('mean_rank', 0) > 100:
                recommendations.append({
                    'issue': 'No_Prepositions has very poor performance',
                    'analysis': 'Prepositions appear to be crucial for Khmer IR',
                    'recommendation': 'DO NOT remove prepositions from Khmer text',
                    'severity': 'HIGH'
                })
        
        # Check for vocabulary size issues
        for label, results in comparison_results.items():
            vocab_size = results.get('vocab_size', 0)
            doc_count = results.get('doc_count', 0)
            
            if vocab_size > 0 and doc_count > 0:
                ratio = vocab_size / doc_count
                if ratio > 10:
                    recommendations.append({
                        'issue': f'{label} has high vocabulary-to-document ratio',
                        'analysis': f'Ratio: {ratio:.1f} (vocab={vocab_size}, docs={doc_count})',
                        'recommendation': 'Consider reducing max_features or increasing min_df',
                        'severity': 'MEDIUM'
                    })
        
        # Check for density issues
        for label, results in comparison_results.items():
            density = results.get('corpus_density', 0)
            if density < 0.0001:
                recommendations.append({
                    'issue': f'{label} has very sparse TF-IDF matrix',
                    'analysis': f'Density: {density:.6f}',
                    'recommendation': 'Adjust TF-IDF parameters or use different features',
                    'severity': 'HIGH'
                })
        
        # Save recommendations
        if recommendations:
            rec_file = os.path.join(output_dir, 'recommendations.json')
            with open(rec_file, 'w', encoding='utf-8') as f:
                json.dump(recommendations, f, ensure_ascii=False, indent=2)
            
            print("\nüìã RECOMMENDATIONS:")
            for rec in recommendations:
                print(f"\n[{rec['severity']}] {rec['issue']}")
                print(f"  Analysis: {rec['analysis']}")
                print(f"  Recommendation: {rec['recommendation']}")

# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    # Configuration
    processed_dir = r'D:\Year 5\S1\Information Retrieval\StopwordProject\khmer_stopword_project\data\processed'
    output_dir = r'D:\Year 5\S1\Information Retrieval\StopwordProject\khmer_stopword_project\data\ir_evaluation_debug'
    
    # Initialize evaluator
    evaluator = DebugKhmerIREvaluator()
    
    # Run comprehensive evaluation
    evaluator.run_comprehensive_evaluation(processed_dir, output_dir)
    
    print(f"\n‚úÖ All results saved to: {output_dir}")

COMPREHENSIVE IR EVALUATION WITH DEBUGGING

üìÅ LOADING CORPORA:

üìÅ Loading: Original
  Raw documents: 58,409
  After cleaning: 58,398
  Removed 16 very short documents (<3 tokens)

üîç DIAGNOSING CORPUS: Original
  Document count: 58,382
  Avg document length: 29.0 tokens
  Median document length: 25.0 tokens
  Min/Max length: 3 / 535 tokens

  Token statistics:
  Total tokens: 1,691,794
  Unique tokens: 45,453

  Token length distribution:
    2 chars: 197,607 (11.7%)
    3 chars: 443,103 (26.2%)
    4 chars: 191,029 (11.3%)
    5 chars: 269,176 (15.9%)
    6 chars: 202,084 (11.9%)
    7 chars: 142,929 (8.4%)
    8 chars: 86,120 (5.1%)
    9 chars: 54,400 (3.2%)
    10 chars: 40,655 (2.4%)
    11 chars: 26,927 (1.6%)

  Top 20 most common tokens:
    ‚úì '·ûî·û∂·ûì' (len=3): 39,983
    ‚úì '·ûì·û∑·ûÑ' (len=3): 36,396
    ‚úì '·ûä·üÇ·ûõ' (len=3): 29,783
    ‚úì '·ûò·û∂·ûì' (len=3): 27,127
    ‚úì '·ûá·û∂' (len=2): 22,253
    ‚úì '·ûì·üÖ' (len=2): 21,807
    ‚úì '·ûê·û∂' (len=2): 