# IR Evaluation: Impact of Stopword Removal

This notebook implements the Information Retrieval (IR) evaluation as part of the Stop Word Project.
We compare the retrieval performance of a Search Engine using two datasets:
1. **Full Text**: Documents with all words (Stopwords included).
2. **Cleaned Text**: Documents with Khmer stopwords removed.

## Objectives:
- Prepare the two datasets (segmentation and filtering).
- Implement a TF-IDF based Vector Space Model.
- Evaluate retrieval quality using **Known-Item Retrieval** task (simulating search ranking).
- Metrics: Mean Rank, Recall@K.


In [None]:
# Install necessary libraries if not present
!pip install khmer-nltk scikit-learn pandas matplotlib

In [None]:
import csv
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

try:
    from khmernltk import word_tokenize
    print("Khmer NLTK loaded successfully.")
except ImportError:
    print("Khmer NLTK not found. Please run the installation cell above.")


## 1. Load Resources and Data
We load the comprehensive Stopword list and the raw text corpus.

In [None]:
def load_custom_stopwords(csv_path):
    stopwords = set()
    if not os.path.exists(csv_path):
        print(f"Warning: Stopword file not found at {csv_path}")
        return stopwords
        
    with open(csv_path, encoding="utf-8-sig") as f:
        reader = csv.DictReader(f)
        for row in reader:
            # Filter based on Linguistic Group (Keep 'Content Words', Remove functional ones)
            # If the group implies function word, add to stopwords.
            # Based on previous notebook logic, we keep only 'Content Words' and remove others.
            if "content word" not in row.get("linguistic_group", "").lower():
                stopwords.add(row["term"].strip())
    return stopwords

STOPWORDS_PATH = "../stopwords/FIle_Stopwords.csv"
KHMER_STOPWORDS = load_custom_stopwords(STOPWORDS_PATH)
print(f"Loaded {len(KHMER_STOPWORDS)} Khmer stopwords.")

In [None]:
def load_and_process_corpus(filepath, limit=5000):
    """
    Reads the raw file, tokenizes it, and creates two versions:
    1. segmented_text (with stopwords)
    2. filtered_text (without stopwords)
    """
    raw_docs = []
    corpus_sw = []
    corpus_no_sw = []
    
    if not os.path.exists(filepath):
        print(f"Error: Data file not found at {filepath}")
        return [], [], []

    with open(filepath, 'r', encoding='utf-8') as f:
        count = 0
        for line in f:
            line = line.strip()
            if not line: 
                continue
                
            try:
                tokens = word_tokenize(line)
                if not tokens: continue
                
                # Join for TF-IDF (space separated)
                text_sw = " ".join(tokens)
                
                # Remove stopwords
                tokens_filtered = [t for t in tokens if t not in KHMER_STOPWORDS]
                text_no_sw = " ".join(tokens_filtered)
                
                raw_docs.append(line)
                corpus_sw.append(text_sw)
                corpus_no_sw.append(text_no_sw)
                
                count += 1
                if count >= limit:
                    break
            except Exception as e:
                continue
                
    print(f"Processed {len(corpus_sw)} documents.")
    return raw_docs, corpus_sw, corpus_no_sw

# Load a sample of 3000 documents for evaluation speed
DATA_PATH = "../data/raw/news_text_file_150k.txt"
raw_docs, docs_with_sw, docs_without_sw = load_and_process_corpus(DATA_PATH, limit=3000)

## 2. IR System Implementation
We use TF-IDF weighting and Cosine Similarity.
We define a function `evaluate_ir` that takes a corpus and a set of query documents.

In [None]:
def evaluate_retrieval(corpus, query_indices, top_k=10):
    """
    Evaluates retrieval performance using Known-Item Retrieval.
    For each document in query_indices, we try to retrieve it from the corpus.
    Ideally, it should be Rank 1.
    """
    # 1. Build Index
    vectorizer = TfidfVectorizer()
    X_corpus = vectorizer.fit_transform(corpus)
    
    # 2. Process Queries
    # The queries are the documents themselves
    queries = [corpus[i] for i in query_indices]
    X_queries = vectorizer.transform(queries)
    
    # 3. Compute Similarity
    # Shape: (n_queries, n_corpus)
    sim_matrix = cosine_similarity(X_queries, X_corpus)
    
    ranks = []
    hits_at_k = 0
    
    for i, true_doc_idx in enumerate(query_indices):
        scores = sim_matrix[i]
        
        # Sort indices by score descending
        sorted_indices = np.argsort(scores)[::-1]
        
        # Find where the true document is in the ranked list
        # np.where returns a tuple, [0][0] gets the index
        rank_positions = np.where(sorted_indices == true_doc_idx)[0]
        
        if len(rank_positions) > 0:
            rank = rank_positions[0] + 1 # 1-based rank
        else:
            rank = len(corpus) # Should not happen if query is in corpus
            
        ranks.append(rank)
        if rank <= top_k:
            hits_at_k += 1
            
    mean_rank = np.mean(ranks)
    recall_at_k = hits_at_k / len(query_indices)
    
    return mean_rank, recall_at_k, ranks


## 3. Run Experiments
We select 100 random documents as "queries" and test retrieval on both datasets.

In [None]:
random.seed(42)
NUM_QUERIES = 50
if len(docs_with_sw) > NUM_QUERIES:
    query_indices = random.sample(range(len(docs_with_sw)), NUM_QUERIES)
else:
    query_indices = list(range(len(docs_with_sw)))

print(f"Selected {len(query_indices)} random documents as queries.")

# Experiment 1: With Stopwords
print("\n--- Evaluating WITH Stopwords ---")
mr_sw, r_k_sw, ranks_sw = evaluate_retrieval(docs_with_sw, query_indices)
print(f"Mean Rank: {mr_sw:.2f}")
print(f"Recall@10: {r_k_sw:.2f}")

# Experiment 2: Without Stopwords
print("\n--- Evaluating WITHOUT Stopwords ---")
mr_now, r_k_now, ranks_now = evaluate_retrieval(docs_without_sw, query_indices)
print(f"Mean Rank: {mr_now:.2f}")
print(f"Recall@10: {r_k_now:.2f}")


## 4. Analysis and Visualization
We compare the rank distribution to see if removing stopwords helps the correct document appear higher (closer to rank 1).

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(ranks_sw, alpha=0.5, label='With Stopwords', bins=20)
plt.hist(ranks_now, alpha=0.5, label='Without Stopwords', bins=20)
plt.xlabel('Rank of Relevant Document')
plt.ylabel('Frequency')
plt.title('Distribution of Ranks (Lower is Better)')
plt.legend()
plt.grid(True)
plt.show()

print("Diff in Mean Rank:", mr_sw - mr_now)
if mr_now < mr_sw:
    print("Conclusion: Removing stopwords IMPROVED retrieval performance.")
else:
    print("Conclusion: Removing stopwords DID NOT improve retrieval performance (or slight degradation).")

### Interpretation
- **Mean Rank**: The average position of the correct document. Lower is better (1.0 is perfect).
- **Recall@10**: Percentage of times the correct document appeared in the top 10 results.

In highly specific retrieval (like this known-item task), stopwords can sometimes help by providing phrase specificity, but in general topic retrieval, they add noise. If the Mean Rank decreases after removal, our customized stopword list is effective.