# 🧪 Project Trend Hunter: Analysis Playground

Welcome to the interactive test bench! Here you can run the entire trend detection pipeline step-by-step, toggle different methods, and visualize the results immediately.

### 🎯 Objectives:
1.  **Compare Methods**: Semantic (Google Trends) vs. Hybrid (Cluster-First).
2.  **Verify Reranking**: See the difference Cross-Encoder makes.
3.  **Inspect Data**: View raw posts, clusters, and sentiment scores.

---

In [None]:
# 1. Setup & Imports
import sys
import os
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from rich.console import Console

# Ensure project root is in path
sys.path.append(os.path.abspath('..'))

from crawlers.analyze_trends import find_matches, find_matches_hybrid, load_social_data, load_news_data, load_google_trends
from crawlers.alias_normalizer import build_alias_dictionary

console = Console()
pd.set_option('display.max_colwidth', 100)
%matplotlib inline

## ⚙️ Configuration
Adjust these parameters to control the experiment.

In [None]:
LIMIT_POSTS = 500  # Set to None for full run (~4600 posts), 500 for testing
RERANK = True      # Enable Cross-Encoder Reranking (Slower but precise)
USE_PHOBERT = True # Use PhoBERT for sentiment
THRESHOLD = 0.5    # Similarity threshold

## 📂 1. Load Data

In [None]:
# Load Trends
trend_files = glob.glob("../crawlers/trendings/*.csv")
trends = load_google_trends(trend_files)
print(f"Loaded {len(trends)} trends.")

# Load Social & News
fb_files = glob.glob("../crawlers/facebook/*.json")
news_files = glob.glob("../crawlers/news/**/*.csv", recursive=True)
posts = load_social_data(fb_files) + load_news_data(news_files)

if LIMIT_POSTS:
    posts = posts[:LIMIT_POSTS]
    
# Helper: Extract contents
post_contents = [p['processed_content'] for p in posts]
print(f"Loaded {len(posts)} posts for analysis.")

## 🔬 2. Run Semantic Analysis (Baseline)
Standard Bi-Encoder matching (fast, fuzzy).

In [None]:
print("Running Semantic Matching...")
matches_semantic = find_matches(
    posts, trends, 
    threshold=THRESHOLD, 
    model_name="paraphrase-multilingual-mpnet-base-v2",
    saving_mode=True  # Return list directly
)
df_sem = pd.DataFrame(matches_semantic)
print("Semantic Match Count:", len(df_sem[df_sem['is_matched'] == True]))
df_sem.head(3)

## 🚀 3. Run Hybrid Analysis (Cluster-First)
This uses HDBSCAN + Cross-Encoder (if enabled).

In [None]:
print(f"Running Hybrid Analysis (Rerank={RERANK})...")
matches_hybrid = find_matches_hybrid(
    posts, trends, 
    threshold=THRESHOLD, 
    model_name="paraphrase-multilingual-mpnet-base-v2",
    rerank=RERANK
)
df_hyb = pd.DataFrame(matches_hybrid)
print("Hybrid Topics Found:", df_hyb['final_topic'].nunique())
df_hyb[['final_topic', 'topic_type', 'score', 'post_content']].head(5)

## 📊 4. Comparison & Visualization
Let's see the metrics side-by-side.

In [None]:
# Comparison Data
stats = {
    'Method': ['Semantic', 'Hybrid'],
    'Total Matched/Clustered': [
        len(df_sem[df_sem['is_matched'] == True]),
        len(df_hyb[df_hyb['topic_type'] != 'Noise'])
    ],
    'Unique Topics': [
        df_sem[df_sem['is_matched'] == True]['trend'].nunique(),
        df_hyb[df_hyb['topic_type'] != 'Noise']['final_topic'].nunique()
    ]
}
df_stats = pd.DataFrame(stats)

fig, ax = plt.subplots(1, 2, figsize=(12, 5))
sns.barplot(data=df_stats, x='Method', y='Total Matched/Clustered', ax=ax[0], palette='viridis')
ax[0].set_title("Coverage (Total Posts)")

sns.barplot(data=df_stats, x='Method', y='Unique Topics', ax=ax[1], palette='magma')
ax[1].set_title("Diversity (Unique Topics)")
plt.tight_layout()
plt.show()

## 🌟 5. Discovery Viewer
Let's look at the **New Discoveries** found by the Hybrid method (Clusters that did NOT match a trend).

In [None]:
discoveries = df_hyb[df_hyb['topic_type'] == 'Discovery']
top_discoveries = discoveries['final_topic'].value_counts().head(10)

print("Top 10 New Discoveries:")
print(top_discoveries)

# Show samples
if not top_discoveries.empty:
    top_topic = top_discoveries.index[0]
    print(f"\nSample posts for top discovery '{top_topic}':")
    print(discoveries[discoveries['final_topic'] == top_topic]['post_content'].head(3).values)