# Linguistic Analysis & Recommendations

This final notebook analyzes the linguistic composition of the Khmer stopword list. 
We aim to understand **which types** of words (e.g., Prepositions, Particles, Pronouns) contribute most to the "noise" in Khmer text.

## Key Differences from Previous Steps
- **Notebook 02**: Analyzed frequency at the *word* level (TF-IDF).
- **Notebook 03**: Evaluated the *system* performance (IR).
- **Notebook 04 (This)**: Analyzes at the **Linguistic Category** level to form standard recommendations.

## Goal
Provide a standardized recommendation for Khmer Stopword Removal: **Which categories should always be removed?**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import os

# Ensure plots look nice
sns.set_style("whitegrid")
plt.rcParams['font.family'] = 'sans-serif' # Fallback for non-Khmer fonts in charts if needed

## 1. Analyze the Stopword List Structure
First, we load the annotated stopword file to see the distribution of categories.

In [None]:
STOPWORDS_PATH = "../stopwords/FIle_Stopwords.csv"

if not os.path.exists(STOPWORDS_PATH):
    print("Stopword file not found! Please check the path.")
else:
    # Load CSV
    df_sw = pd.read_csv(STOPWORDS_PATH)
    
    # Clean column names (strip spaces)
    df_sw.columns = [c.strip() for c in df_sw.columns]
    
    # Standardize Group Names
    df_sw['linguistic_group'] = df_sw['linguistic_group'].str.strip().str.title()
    
    print(f"Total unique terms loaded: {len(df_sw)}")
    print("Columns:", df_sw.columns.tolist())
    display(df_sw.head())

In [None]:
# Count terms per category
category_counts = df_sw['linguistic_group'].value_counts()

plt.figure(figsize=(10, 6))
sns.barplot(x=category_counts.values, y=category_counts.index, palette="viridis")
plt.title("Number of Stopwords per Linguistic Category")
plt.xlabel("Count of Unique Terms")
plt.show()

## 2. Analyze Usage Frequency in Corpus
It's not just about how many *words* are in a category, but **how often** they appear in real text.
A category with only 5 words (like 'Particles') might account for 20% of the total word count in a document!

In [None]:
# Load Corpus (Sample)
DATA_PATH = "../data/raw/news_text_file_150k.txt"

def get_corpus_word_counts(filepath, limit=5000):
    total_word_counts = Counter()
    
    if not os.path.exists(filepath):
        return total_word_counts
    
    try:
        from khmernltk import word_tokenize
    except ImportError:
        print("Please install khmer-nltk first!")
        return total_word_counts

    with open(filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= limit: break
            line = line.strip()
            if line:
                tokens = word_tokenize(line)
                total_word_counts.update(tokens)
    return total_word_counts

# Get counts from sample
corpus_counts = get_corpus_word_counts(DATA_PATH, limit=2000)
print(f"Analyzed sample corpus. Found {len(corpus_counts)} unique tokens.")

In [None]:
# Map each stopword to its total frequency in the corpus
df_sw['corpus_frequency'] = df_sw['term'].apply(lambda x: corpus_counts.get(x, 0))

# Group by Category to see total noise contribution
category_impact = df_sw.groupby('linguistic_group')['corpus_frequency'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=category_impact.values, y=category_impact.index, palette="magma")
plt.title("Total Frequency in Corpus by Category (Noise Contribution)")
plt.xlabel("Total Occurrences in Corpus Sample")
plt.grid(axis='x')
plt.show()

## 3. Findings & Recommendations

Based on the data above, we categorize stopword groups into **Tiers of Importance** for removal.

### Tier 1: Critical to Remove (High Frequency / Low Meaning)
These are purely functional and appear constantly.
- **Particles (Ex: ក៏, នូវ, នៃ)**: These often have the highest frequency and zero semantic value in search.
- **Prepositions (Ex: នៅ, ក្នុង, ពី)**: Necessary for grammar but noise for topic modeling.
- **Conjunctions (Ex: និង, ហើយ, ប៉ុន្តែ)**: Connectors that don't change the topic.

### Tier 2: Recommended to Remove (Medium Frequency)
- **Pronouns (Ex: ខ្ញុំ, គេ, ឯង)**: Usually safe to remove, unless doing "Author Identification" or specific Entity Extraction.
- **Determiners/Quantifiers (Ex: នេះ, នោះ, ខ្លះ)**: Very frequent, low information content.

### Tier 3: Context Dependent (Keep for some tasks)
- **Auxilliary Verbs (Ex: បាន, កំពុង)**: Temporal markers. Useful for Sentiment Analysis (tense matters) but not for Keyword Search.
- **Numbers**: Often removed, but crucial for financial/technical documents.

### Tier 4: Do Not Remove (Content Words)
- If any nouns or specific verbs ended up in the list, they should be filtered out (as we did in Notebook 03).

In [None]:
# Calculate the percentage of 'noise' removed if we drop Tier 1 & 2
total_corpus_tokens = sum(corpus_counts.values())
stopwords_total_tokens = df_sw['corpus_frequency'].sum()

print(f"Total Tokens in Sample: {total_corpus_tokens}")
print(f"Tokens identified as Stopwords: {stopwords_total_tokens}")
print(f"Percentage of Text Reduced: {stopwords_total_tokens / total_corpus_tokens * 100:.2f}%")

print("\nThis massive reduction (~30-50% usually) explains why IR performance improves: ")
print("The model can focus on the remaining content words that actually carry meaning.")

# Liguistic Analysis