# RAG Sport Articles - Semantic Search

## Objectives:
1. Convert raw text into dense vector representations
2. Build and query a vector index (FAISS)
3. Evaluate semantic similarity between queries and documents

## Pipeline:
`Text ‚Üí Encoder ‚Üí Vectors ‚Üí ANN Index ‚Üí Query ‚Üí Ranked Results`

## Step 1: Load and Prepare Data

Load the cleaned articles from `detik_sport_articles_cleaned.json`

In [1]:
import json
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

# Load cleaned articles
DATA_PATH = "scraping_result/detik_sport_articles_cleaned.json"

with open(DATA_PATH, "r", encoding="utf-8") as f:
    articles = json.load(f)

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(articles)
print(f"Loaded {len(df)} articles")
print(f"\nColumns: {df.columns.tolist()}")

# Validate required columns
required_columns = ['title', 'content', 'date', 'url']
missing_columns = [col for col in required_columns if col not in df.columns]

if missing_columns:
    print(f"‚ö†Ô∏è Missing columns: {missing_columns}")
    # Create placeholder columns if missing
    for col in missing_columns:
        if col == 'title':
            df['title'] = df['content'].str[:50] + "..."
        elif col == 'date':
            df['date'] = "Unknown"
        elif col == 'url':
            df['url'] = ""
else:
    print("‚úÖ All required columns present")

# Remove rows with empty or very short content
initial_count = len(df)
df = df[df['content'].str.len() > 50].reset_index(drop=True)
print(f"Removed {initial_count - len(df)} articles with very short content")
print(f"Final dataset: {len(df)} articles")

print(f"\nSample article:")
print(f"Title: {df.iloc[0]['title']}")
print(f"Content preview: {df.iloc[0]['content'][:300]}...")

Loaded 397 articles

Columns: ['url', 'title', 'date', 'author', 'content']
‚úÖ All required columns present
Removed 1 articles with very short content
Final dataset: 396 articles

Sample article:
Title: Deco: Rashford Menderita di MU
Content preview: Jakarta - Performa Marcus Rashford membaik di Barcelona . Direktur Barcelona Deco mengungkap penyebab Rashford kesulitan di Manchester United . Rashford dipinjamkan MU ke Barcelona pada musim panas lalu. Penyerang Inggris itu dilepas usai tidak masuk ke rencana Ruben Amorim. Ini jadi kali kedua seca...


## Step 2: Dense Encoding with Multilingual E5

Using `intfloat/multilingual-e5-base` because:
- ‚úÖ Supports 100+ languages including **Indonesian**
- ‚úÖ State-of-the-art performance for multilingual semantic search
- ‚úÖ Good balance between accuracy and speed

**Note:** E5 models require a prefix for the input:
- For documents/passages: `"passage: "` + text
- For queries: `"query: "` + text

In [2]:
# Disable tqdm widget to avoid ipywidget rendering issues in VS Code
os.environ["TQDM_DISABLE"] = "0"  # Keep tqdm but use text mode

from sentence_transformers import SentenceTransformer
import logging

# Suppress unnecessary warnings
logging.getLogger("sentence_transformers").setLevel(logging.WARNING)

# Load the multilingual E5 model
MODEL_NAME = "intfloat/multilingual-e5-base"
print(f"Loading model: {MODEL_NAME}")

model = SentenceTransformer(MODEL_NAME, device="cpu")  # Explicitly set device
print(f"Model loaded successfully!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Loading model: intfloat/multilingual-e5-base
Model loaded successfully!
Embedding dimension: 768


In [3]:
# Prepare documents for encoding
# E5 models require "passage: " prefix for documents

def prepare_documents(df: pd.DataFrame) -> list[str]:
    """
    Prepare documents for E5 encoding.
    Combines title and content with the required prefix.
    """
    documents = []
    for _, row in df.iterrows():
        # Combine title and content for richer representation
        text = f"{row['title']}. {row['content']}"
        # Add E5 passage prefix
        documents.append(f"passage: {text}")
    return documents

# Prepare all documents
documents = prepare_documents(df)
print(f"Prepared {len(documents)} documents for encoding")
print(f"\nSample document (truncated):")
print(documents[0][:500] + "...")

Prepared 396 documents for encoding

Sample document (truncated):
passage: Deco: Rashford Menderita di MU. Jakarta - Performa Marcus Rashford membaik di Barcelona . Direktur Barcelona Deco mengungkap penyebab Rashford kesulitan di Manchester United . Rashford dipinjamkan MU ke Barcelona pada musim panas lalu. Penyerang Inggris itu dilepas usai tidak masuk ke rencana Ruben Amorim. Ini jadi kali kedua secara beruntun Rashford dipinjamkan oleh MU. Sebelumnya, pesepakbola berusia 28 tahun itu dipinjamkna ke Aston Villa pada paruh kedua musim 2024/2025. Bersama Bar...


In [4]:
# Encode all documents into dense vectors
print("Encoding documents... (this may take a few minutes)")

# Encode with text-based progress bar (not widget)
embeddings = model.encode(
    documents,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True  # Normalize for cosine similarity
)

print(f"\nEncoding complete!")
print(f"Embeddings shape: {embeddings.shape}")
print(f"Each document is represented by a {embeddings.shape[1]}-dimensional vector")

Encoding documents... (this may take a few minutes)


Batches:   0%|          | 0/13 [00:00<?, ?it/s]


Encoding complete!
Embeddings shape: (396, 768)
Each document is represented by a 768-dimensional vector


In [5]:
# Save embeddings for later use
EMBEDDINGS_PATH = "embedding/article_embeddings.npy"

np.save(EMBEDDINGS_PATH, embeddings)
print(f"Embeddings saved to {EMBEDDINGS_PATH}")
print(f"File size: {embeddings.nbytes / 1024 / 1024:.2f} MB")

Embeddings saved to embedding/article_embeddings.npy
File size: 1.16 MB


## Step 3: Building FAISS Vector Index

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.

In [6]:
import faiss

# Get embedding dimension
dimension = embeddings.shape[1]

# Create FAISS index
# Using IndexFlatIP for Inner Product (cosine similarity with normalized vectors)
index = faiss.IndexFlatIP(dimension)

# Add embeddings to the index
index.add(embeddings.astype('float32'))

print(f"FAISS index created!")
print(f"Index type: Flat Inner Product (exact search)")
print(f"Dimension: {dimension}")
print(f"Total vectors indexed: {index.ntotal}")

# Save FAISS index for later use
INDEX_PATH = "faiss/faiss_index.bin"
faiss.write_index(index, INDEX_PATH)
print(f"\nüíæ FAISS index saved to {INDEX_PATH}")
print(f"File size: {os.path.getsize(INDEX_PATH) / 1024 / 1024:.2f} MB")

FAISS index created!
Index type: Flat Inner Product (exact search)
Dimension: 768
Total vectors indexed: 396

üíæ FAISS index saved to faiss/faiss_index.bin
File size: 1.16 MB


### Optional: Load Existing Embeddings and Index

Run this cell instead of Step 2 & 3 if you already have saved embeddings and index.

In [7]:
# Skip encoding if embeddings and index already exist
# Run this cell INSTEAD of Step 2 encoding cells if you want to reload existing data

import faiss
from sentence_transformers import SentenceTransformer

EMBEDDINGS_PATH = "embedding/article_embeddings.npy"
INDEX_PATH = "faiss/faiss_index.bin"
MODEL_NAME = "intfloat/multilingual-e5-base"

if os.path.exists(EMBEDDINGS_PATH) and os.path.exists(INDEX_PATH):
    print("üìÇ Found existing embeddings and index. Loading...")
    embeddings = np.load(EMBEDDINGS_PATH)
    index = faiss.read_index(INDEX_PATH)

    # Load model for query encoding
    print(f"Loading model: {MODEL_NAME}")
    model = SentenceTransformer(MODEL_NAME)

    print(f"‚úÖ Loaded embeddings: {embeddings.shape}")
    print(f"‚úÖ Loaded FAISS index: {index.ntotal} vectors")
    print(f"‚úÖ Model loaded for query encoding")
else:
    print("‚ùå No existing data found. Please run Step 2 & 3 cells to encode documents.")

üìÇ Found existing embeddings and index. Loading...
Loading model: intfloat/multilingual-e5-base
‚úÖ Loaded embeddings: (396, 768)
‚úÖ Loaded FAISS index: 396 vectors
‚úÖ Model loaded for query encoding


## Step 4: Query and Retrieval

Create a search function that:
1. Encodes the query using the same model
2. Uses FAISS to find top-k similar documents
3. Returns titles and similarity scores

In [17]:
def search(query: str, top_k: int = 5) -> pd.DataFrame:
    """
    Search for semantically similar documents.

    Args:
        query: Search query in Indonesian or English
        top_k: Number of results to return

    Returns:
        DataFrame with search results
    """
    # E5 requires "query: " prefix for queries
    query_with_prefix = f"query: {query}"

    # Encode the query
    query_embedding = model.encode(
        [query_with_prefix],
        normalize_embeddings=True,
        convert_to_numpy=True
    )

    # Search in FAISS index
    scores, indices = index.search(query_embedding.astype('float32'), top_k)

    # Build results DataFrame
    results = []
    for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
        results.append({
            'rank': i + 1,
            'score': float(score),
            'title': df.iloc[idx]['title'],
            'date': df.iloc[idx]['date'],
            'content_preview': df.iloc[idx]['content'][:200] + "...",
            'url': df.iloc[idx]['url']
        })

    return pd.DataFrame(results)


def display_results(query: str, top_k: int = 5):
    """Pretty print search results"""
    print(f"üîç Query: \"{query}\"")
    print("=" * 80)

    results = search(query, top_k)

    for _, row in results.iterrows():
        print(f"\n#{row['rank']} | Score: {row['score']:.4f}")
        print(f"üì∞ {row['title']}")
        print(f"üìÖ {row['date']}")
        print(f"üìù {row['content_preview']}")
        print("-" * 80)

    return results

## Step 5: Qualitative Evaluation

Test the semantic search with various queries in Indonesian to see if the results are semantically related.

In [18]:
# Test Query 1: Tentang pemain sepakbola tertentu
results1 = display_results("Ronaldo Piala Dunia 2026", top_k=5)

üîç Query: "Ronaldo Piala Dunia 2026"

#1 | Score: 0.8853
üì∞ Ronaldo Lolos Sanksi, Bisa Main di Fase Grup Piala Dunia 2026
üìÖ Rabu, 26 Nov 2025 07:00 WIB
üìù Zurich - Timnas Portugal bisa bernapas lega. Sebab Cristiano Ronaldo lolos dari sanksi kartu merah dan bisa main di Piala Dunia 2026 . Sebelumnya, Ronaldo dikartumerah saat Portugal tumbang 0-2 di kan...
--------------------------------------------------------------------------------

#2 | Score: 0.8749
üì∞ Netizen 'Ngamuk' Usai FIFA Ringankan Hukuman Ronaldo di Piala Dunia 2026
üìÖ Rabu, 26 Nov 2025 10:08 WIB
üìù Jakarta - Cristiano Ronaldo mendapat keringanan sanksi kartu merah, yang membuatnya bisa tampil di dua laga awal Piala Dunia 2026 . FIFA langsung diamuk netizen. Ronaldo mendapat keringanan dari FIFA ...
--------------------------------------------------------------------------------

#3 | Score: 0.8699
üì∞ Pot Drawing Piala Dunia 2026: Potensi Haaland Vs Messi atau Ronaldo
üìÖ Rabu, 26 Nov 2025 14:00 WIB
üìù

In [10]:
# Test Query 2: Tentang liga tertentu
results2 = display_results("Klasemen Liga Inggris terbaru", top_k=5)

üîç Query: "Klasemen Liga Inggris terbaru"

#1 | Score: 0.8500
üì∞ Jadwal Liga Inggris Tengah Pekan Ini: Man City Main Nanti Malam
üìÖ Selasa, 02 Des 2025 09:40 WIB
üìù Jakarta - Premier League akan menggelar pertandingan pekan ke-14 pada tengah pekan ini. Main lebih dulu, Manchester City berpeluang merapatkan jarak dengan Arsenal . City akan tandang ke markas Fulham...
--------------------------------------------------------------------------------

#2 | Score: 0.8383
üì∞ Liverpool Mau Stabil Dulu, Belum Pikirkan Klasemen
üìÖ Selasa, 02 Des 2025 11:00 WIB
üìù Jakarta - Kemenangan atas West Ham United memberi kesempatan Liverpool menata diri. Si Merah bertekad menemukan stabilitas dulu untuk saat ini. Setelah tiga kekalahan telak beruntun, Liverpool meraih ...
--------------------------------------------------------------------------------

#3 | Score: 0.8279
üì∞ Pemain dengan Assist Terbanyak di Liga Inggris
üìÖ Selasa, 02 Des 2025 14:40 WIB
üìù Daftar Isi Daftar Pemain deng

In [11]:
# Test Query 3: Tentang Timnas Indonesia
results3 = display_results("Pelatih baru Timnas Indonesia", top_k=5)

üîç Query: "Pelatih baru Timnas Indonesia"

#1 | Score: 0.8482
üì∞ Gabung Navbahor, Kapadze Dipastikan Tidak Latih Timnas Indonesia
üìÖ Selasa, 02 Des 2025 00:00 WIB
üìù Jakarta - Timur Kapadze dipastikan tidak akan menjadi pelatih Timnas Indonesia . Dia sudah menerima pinangan klub Navbahor Namangan. "Navbahor hari ini secara resmi memperkenalkan Timur Kapadze sebaga...
--------------------------------------------------------------------------------

#2 | Score: 0.8391
üì∞ SEA Games 2025: Timnas Basket Coret 3 Nama
üìÖ Selasa, 14 Okt 2025 18:30 WIB
üìù Jakarta - Timnas basket Indonesia akan kembali mencoret tiga nama. Itu sebagai bagian dari persiapan mengikuti rangkaian uji coba di Australia menuju SEA Games 2025. Diketahui, Timnas telah menjalani ...
--------------------------------------------------------------------------------

#3 | Score: 0.8345
üì∞ BTN Panggil 24 Pebasket Ikuti TC SEA Games 2025
üìÖ Senin, 08 Sep 2025 18:30 WIB
üìù Jakarta - Sebanyak 24 pebasket putra

In [12]:
# Test Query 4: Tentang transfer pemain
results4 = display_results("Barcelona beli pemain baru", top_k=5)

üîç Query: "Barcelona beli pemain baru"

#1 | Score: 0.8257
üì∞ Hari Sempurna buat Barcelona: Main Lagi di Camp Nou, Menang, Clean Sheet
üìÖ Minggu, 23 Nov 2025 11:00 WIB
üìù Jakarta - Barcelona menandai comeback-nya ke Camp Nou dengan kemenangan telak atas Athletic Bilbao . Segalanya terasa sempurna untuk Blaugrana. Barcelona menjamu Bilbao di Camp Nou dalam lanjutan LaLi...
--------------------------------------------------------------------------------

#2 | Score: 0.8250
üì∞ Barcelona Vs Bilbao: Saatnya Barca Rebut Puncak Klasemen di Camp Nou
üìÖ Sabtu, 22 Nov 2025 15:00 WIB
üìù Barcelona - Barcelona kembali berlaga di Camp Nou usai nyaris tiga tahun. Kemenangan atas Athletic Bilbao akan melambungkan Barca ke puncak klasemen, setidaknya untuk sementara. Stadion Camp Nou resmi...
--------------------------------------------------------------------------------

#3 | Score: 0.8145
üì∞ Usia Hanya Sekadar Angka untuk Lewandowski
üìÖ Minggu, 23 Nov 2025 19:30 WIB
üìù Barcelona 

In [13]:
# Test Query 5: Query dalam bahasa Inggris (test multilingual capability)
results5 = display_results("Manchester United manager problems", top_k=5)

üîç Query: "Manchester United manager problems"

#1 | Score: 0.8233
üì∞ Sir Beckham: Amorim Pelan-pelan Bawa MU Tampil Sip!
üìÖ Senin, 01 Des 2025 20:08 WIB
üìù Jakarta - Manchester United dinilai mulai membaik di bawah asuhan Ruben Amorim. Sir David Beckham yang mengungkap hal itu. Di klasemen Liga Inggris saat ini, MU ada di posisi ketujuh. The Red Devils m...
--------------------------------------------------------------------------------

#2 | Score: 0.8160
üì∞ Ruben Amorim Sadar Patrick Dorgu Cemas Tiap Kuasai Bola
üìÖ Senin, 01 Des 2025 19:17 WIB
üìù Jakarta - Patrick Dorgu mendapat kritik saat mengawal sisi kiri pertahanan Manchester United. Pemain asal Denmark itu dinilai terlalu cemas saat menguasai bola. Dorgu menjadi sorotan saat Man United k...
--------------------------------------------------------------------------------

#3 | Score: 0.8103
üì∞ Deco: Rashford Menderita di MU
üìÖ Selasa, 02 Des 2025 12:00 WIB
üìù Jakarta - Performa Marcus Rashford membaik di Bar

## Step 5b: Quantitative Evaluation

Evaluate search quality with predefined test cases using **Precision@K** metric.

In [19]:
def evaluate_search_quality(test_queries: list[dict], top_k: int = 5) -> pd.DataFrame:
    """
    Evaluate search quality with predefined test cases.

    Args:
        test_queries: List of dicts with 'query' and 'expected_keywords'
        top_k: Number of results to evaluate

    Returns:
        DataFrame with evaluation metrics
    """
    results = []

    for test in test_queries:
        query = test['query']
        expected_keywords = test['expected_keywords']

        search_results = search(query, top_k=top_k)

        # Calculate hits: how many results contain at least one expected keyword
        hits = 0
        for _, row in search_results.iterrows():
            title_content = (row['title'] + " " + row['content_preview']).lower()
            if any(kw.lower() in title_content for kw in expected_keywords):
                hits += 1

        precision_at_k = hits / top_k

        results.append({
            'query': query,
            'expected_keywords': ', '.join(expected_keywords[:3]) + '...',
            f'hits@{top_k}': hits,
            f'precision@{top_k}': precision_at_k,
            'top_score': search_results.iloc[0]['score'],
            'avg_score': search_results['score'].mean()
        })

    return pd.DataFrame(results)

In [20]:
# Define test cases with expected keywords
test_queries = [
    {
        'query': 'Ronaldo gol Piala Dunia',
        'expected_keywords': ['ronaldo', 'cristiano', 'cr7', 'gol', 'piala dunia']
    },
    {
        'query': 'Liga Inggris klasemen',
        'expected_keywords': ['premier league', 'liga inggris', 'klasemen', 'epl', 'inggris']
    },
    {
        'query': 'Timnas Indonesia pelatih',
        'expected_keywords': ['timnas', 'indonesia', 'pelatih', 'garuda', 'pssi']
    },
    {
        'query': 'Barcelona transfer pemain',
        'expected_keywords': ['barcelona', 'barca', 'transfer', 'beli', 'blaugrana']
    },
    {
        'query': 'MotoGP balapan juara',
        'expected_keywords': ['motogp', 'motor', 'juara', 'race', 'gp', 'balapan']
    },
    {
        'query': 'Liga Italia Serie A',
        'expected_keywords': ['serie a', 'liga italia', 'italia', 'inter', 'milan', 'juventus']
    },
    {
        'query': 'Manchester United masalah',
        'expected_keywords': ['manchester', 'united', 'mu', 'man utd', 'old trafford']
    }
]

# Run evaluation
eval_results = evaluate_search_quality(test_queries, top_k=5)

# Display results
print("üìä Quantitative Evaluation Results")
print("=" * 90)
print(eval_results.to_string(index=False))
print("\n" + "=" * 90)
print(f"üìà Average Precision@5: {eval_results['precision@5'].mean():.2%}")
print(f"üìà Average Top Score: {eval_results['top_score'].mean():.4f}")
print(f"üìà Average Score: {eval_results['avg_score'].mean():.4f}")

üìä Quantitative Evaluation Results
                    query                         expected_keywords  hits@5  precision@5  top_score  avg_score
  Ronaldo gol Piala Dunia                ronaldo, cristiano, cr7...       5          1.0   0.856434   0.844120
    Liga Inggris klasemen premier league, liga inggris, klasemen...       5          1.0   0.851795   0.826338
 Timnas Indonesia pelatih             timnas, indonesia, pelatih...       4          0.8   0.836510   0.827543
Barcelona transfer pemain             barcelona, barca, transfer...       5          1.0   0.822722   0.821603
     MotoGP balapan juara                   motogp, motor, juara...       5          1.0   0.838247   0.834071
      Liga Italia Serie A           serie a, liga italia, italia...       3          0.6   0.828959   0.809120
Manchester United masalah                 manchester, united, mu...       5          1.0   0.836243   0.830848

üìà Average Precision@5: 91.43%
üìà Average Top Score: 0.8387
üìà Avera

## Summary Statistics

Overview of the semantic search system.

In [21]:
INDEX_PATH = "scraping_result/faiss_index.bin"
EMBEDDINGS_PATH = "scraping_result/article_embeddings.npy"

print("üìà Semantic Search System Statistics")
print("=" * 50)
print(f"üìö Total documents indexed: {len(df)}")
print(f"üìê Embedding dimension: {embeddings.shape[1]}")
print(f"ü§ñ Model: intfloat/multilingual-e5-base")
print(f"üîç Index type: FAISS Flat Inner Product")
print(f"\nüíæ Storage:")
print(f"   - Embeddings: {embeddings.nbytes / 1024 / 1024:.2f} MB")
if os.path.exists(INDEX_PATH):
    print(f"   - FAISS index: {os.path.getsize(INDEX_PATH) / 1024 / 1024:.2f} MB")
print(f"\nüìä Evaluation:")
print(f"   - Average Precision@5: {eval_results['precision@5'].mean():.2%}")
print(f"   - Best performing query: {eval_results.loc[eval_results['precision@5'].idxmax(), 'query']}")
print(f"   - Worst performing query: {eval_results.loc[eval_results['precision@5'].idxmin(), 'query']}")

üìà Semantic Search System Statistics
üìö Total documents indexed: 396
üìê Embedding dimension: 768
ü§ñ Model: intfloat/multilingual-e5-base
üîç Index type: FAISS Flat Inner Product

üíæ Storage:
   - Embeddings: 1.16 MB

üìä Evaluation:
   - Average Precision@5: 91.43%
   - Best performing query: Ronaldo gol Piala Dunia
   - Worst performing query: Liga Italia Serie A
