# BM25 Search Engine - Vietnamese Football News

## üìã M·ª•c ƒë√≠ch
Notebook n√†y implement BM25 ranking algorithm ƒë·ªÉ t√¨m ki·∫øm v√† x·∫øp h·∫°ng documents t·ª´ d·ªØ li·ªáu b√≥ng ƒë√° Vi·ªát Nam.

## üéØ Ch·ª©c nƒÉng ch√≠nh:
1. Load v√† preprocess documents t·ª´ JSON files
2. Build BM25 index v·ªõi Vietnamese text processing
3. Query search v·ªõi ranking scores
4. Hi·ªÉn th·ªã k·∫øt qu·∫£ chi ti·∫øt (title, snippet, score, document info)

## 1. Import Libraries

In [1]:
import os
import json
import re
from collections import Counter
from tqdm import tqdm
import pandas as pd

# BM25
try:
    from rank_bm25 import BM25Okapi
    print("‚úì rank_bm25 ƒë√£ ƒë∆∞·ª£c c√†i ƒë·∫∑t")
except ImportError:
    print("‚úó Ch∆∞a c√≥ rank_bm25. C√†i ƒë·∫∑t: pip install rank-bm25")

# PyVi for Vietnamese tokenization
try:
    from pyvi import ViTokenizer
    PYVI_AVAILABLE = True
    print("‚úì PyVi ƒë√£ ƒë∆∞·ª£c c√†i ƒë·∫∑t")
except ImportError:
    print("‚úó PyVi ch∆∞a c√†i ƒë·∫∑t. C√†i ƒë·∫∑t b·∫±ng: pip install pyvi")
    PYVI_AVAILABLE = False

‚úì rank_bm25 ƒë√£ ƒë∆∞·ª£c c√†i ƒë·∫∑t
‚úì PyVi ƒë√£ ƒë∆∞·ª£c c√†i ƒë·∫∑t


## 2. Vietnamese Text Processor Class

In [2]:
class VietnameseTextProcessor:
    """X·ª≠ l√Ω vƒÉn b·∫£n ti·∫øng Vi·ªát cho BM25"""
    
    def __init__(self):
        # Vietnamese stopwords
        self.stop_words = set([
            'v√†', 'c·ªßa', 'trong', 'v·ªõi', 'l√†', 'c√≥', 'ƒë∆∞·ª£c', 'cho', 't·ª´', 'm·ªôt', 'c√°c',
            'ƒë·ªÉ', 'kh√¥ng', 's·∫Ω', 'ƒë√£', 'v·ªÅ', 'hay', 'theo', 'nh∆∞', 'c≈©ng', 'n√†y', 'ƒë√≥',
            'khi', 'nh·ªØng', 't·∫°i', 'sau', 'b·ªã', 'gi·ªØa', 'tr√™n', 'd∆∞·ªõi', 'ngo√†i',
            'th√¨', 'nh∆∞ng', 'm√†', 'ho·∫∑c', 'n·∫øu', 'v√¨', 'do', 'n√™n', 'r·ªìi', 'c√≤n', 'ƒë·ªÅu',
            'ch·ªâ', 'vi·ªác', 'ng∆∞·ªùi', 'l·∫°i', 'ƒë√¢y', 'ƒë·∫•y', '·ªü', 'ra', 'v√†o', 'l√™n', 'xu·ªëng'
        ])
    
    def clean_text(self, text):
        """L√†m s·∫°ch v√† chu·∫©n h√≥a text ti·∫øng Vi·ªát"""
        if not text:
            return ""
        
        # Lo·∫°i b·ªè k√Ω t·ª± ƒë·∫∑c bi·ªát, gi·ªØ l·∫°i ti·∫øng Vi·ªát
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\w\s√†√°·∫£√£·∫°ƒÉ·∫Ø·∫±·∫≥·∫µ·∫∑√¢·∫•·∫ß·∫©·∫´·∫≠√®√©·∫ª·∫Ω·∫π√™·∫ø·ªÅ·ªÉ·ªÖ·ªá√¨√≠·ªâƒ©·ªã√≤√≥·ªè√µ·ªç√¥·ªë·ªì·ªï·ªó·ªô∆°·ªõ·ªù·ªü·ª°·ª£√π√∫·ªß≈©·ª•∆∞·ª©·ª´·ª≠·ªØ·ª±·ª≥√Ω·ª∑·ªπ·ªµƒëƒê]', ' ', text)
        text = text.lower()
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def tokenize(self, text):
        """T√°ch t·ª´ ti·∫øng Vi·ªát"""
        if PYVI_AVAILABLE:
            try:
                return ViTokenizer.tokenize(text).split()
            except:
                pass
        return text.split()
    
    def remove_stopwords(self, tokens):
        """Lo·∫°i b·ªè stopwords"""
        return [token for token in tokens if token not in self.stop_words and len(token) > 1]
    
    def preprocess(self, text):
        """Pipeline x·ª≠ l√Ω ho√†n ch·ªânh"""
        cleaned = self.clean_text(text)
        tokens = self.tokenize(cleaned)
        filtered = self.remove_stopwords(tokens)
        return filtered

print("‚úì VietnameseTextProcessor class ƒë√£ ƒë∆∞·ª£c ƒë·ªãnh nghƒ©a")

‚úì VietnameseTextProcessor class ƒë√£ ƒë∆∞·ª£c ƒë·ªãnh nghƒ©a


## 3. BM25 Search Engine Class

In [None]:
class BM25SearchEngine:
    """BM25 Search Engine cho Vietnamese Football News"""
    
    def __init__(self):
        self.processor = VietnameseTextProcessor()
        self.bm25 = None
        self.documents = []
        self.corpus_tokens = []
        
    def load_data(self, json_files=None):
        """Load documents t·ª´ JSON files"""
        if json_files is None:
            json_files = [
                "vnexpressT_bongda_part1.json",
                "vnexpressT_bongda_part2.json",
                "vnexpressT_bongda_part3.json",
                "vnexpressT_bongda_part4.json"
            ]
        
        print("üì• Loading documents from JSON files...")
        for file_path in json_files:
            if os.path.exists(file_path):
                with open(file_path, 'r', encoding='utf-8') as f:
                    try:
                        data = json.load(f)
                        if isinstance(data, list):
                            self.documents.extend(data)
                        print(f"  ‚úì Loaded {file_path}: {len(data)} documents")
                    except Exception as e:
                        print(f"  ‚úó Error reading {file_path}: {e}")
            else:
                print(f"  ‚úó File not found: {file_path}")
        
        print(f"\n‚úì Total documents loaded: {len(self.documents)}")
        return self.documents
    
    def build_index(self):
        """Build BM25 index t·ª´ documents"""
        if not self.documents:
            print("‚úó No documents to index. Load data first!")
            return
        
        print("\n Building BM25 index...")
        self.corpus_tokens = []
        
        for doc in tqdm(self.documents, desc="Processing documents"):
            # Combine title and content
            title = doc.get('title', '')
            content = doc.get('content', '')
            full_text = f"{title} {content}"
            
            # Tokenize and preprocess
            tokens = self.processor.preprocess(full_text)
            self.corpus_tokens.append(tokens)
        
        # Create BM25 index
        self.bm25 = BM25Okapi(self.corpus_tokens)
        print(f"‚úì BM25 index built with {len(self.corpus_tokens)} documents\n")
    
    def search(self, query, top_k=10):
        """Search v·ªõi BM25 ranking"""
        if self.bm25 is None:
            print("‚úó BM25 index not built. Run build_index() first!")
            return []
        
        # Preprocess query
        query_tokens = self.processor.preprocess(query)
        
        # Get BM25 scores
        scores = self.bm25.get_scores(query_tokens)
        
        # Get top-k results
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        
        results = []
        for idx in top_indices:
            results.append({
                'rank': len(results) + 1,
                'doc_index': idx,
                'score': scores[idx],
                'document': self.documents[idx]
            })
        
        return results
    
    def display_results(self, results, show_content=True, max_content_length=300):
        """Hi·ªÉn th·ªã k·∫øt qu·∫£ t√¨m ki·∫øm"""
        if not results:
            print("‚ùå No results found!")
            return
        
        print(f"\n{'='*100}")
        print(f"FOUND {len(results)} RESULTS")
        print(f"{'='*100}\n")
        
        for result in results:
            rank = result['rank']
            score = result['score']
            doc = result['document']
            
            title = doc.get('title', 'No title')
            url = doc.get('url', 'No URL')
            date = doc.get('date', 'No date')
            author = doc.get('author', 'Unknown')
            content = doc.get('content', '')
            
            # Create content snippet
            if content and len(content) > max_content_length:
                snippet = content[:max_content_length] + "..."
            else:
                snippet = content
            
            print(f"[{rank}]  SCORE: {score:.4f}")
            print(f" Title: {title}")
            print(f" Date: {date} | ‚úçÔ∏è Author: {author}")
            print(f" URL: {url}")
            
            if show_content:
                print(f" Content: {snippet}")
            
            print(f"{'-'*100}\n")
    
    def search_and_display(self, query, top_k=10, show_content=True, max_content_length=300):
        """Th·ª±c hi·ªán search v√† hi·ªÉn th·ªã k·∫øt qu·∫£ ngay"""
        print(f"\n SEARCHING FOR: \"{query}\"\n")
        results = self.search(query, top_k)
        self.display_results(results, show_content, max_content_length)
        return results
    
    def get_statistics(self):
        """Th·ªëng k√™ v·ªÅ corpus"""
        if not self.documents:
            print("No documents loaded!")
            return
        
        total_docs = len(self.documents)
        avg_tokens = sum(len(tokens) for tokens in self.corpus_tokens) / total_docs if self.corpus_tokens else 0
        
        print(f"\n CORPUS STATISTICS")
        print(f"{'='*50}")
        print(f"Total documents: {total_docs}")
        print(f"Average tokens per document: {avg_tokens:.1f}")
        
        if self.corpus_tokens:
            all_tokens = [token for tokens in self.corpus_tokens for token in tokens]
            vocab_size = len(set(all_tokens))
            print(f"Vocabulary size: {vocab_size}")
            
            # Top 10 most common words
            counter = Counter(all_tokens)
            print(f"\nTop 10 most common words:")
            for word, count in counter.most_common(10):
                print(f"  {word}: {count}")

print("‚úì BM25SearchEngine class ƒë√£ ƒë∆∞·ª£c ƒë·ªãnh nghƒ©a")

‚úì BM25SearchEngine class ƒë√£ ƒë∆∞·ª£c ƒë·ªãnh nghƒ©a


## 4. Initialize & Load Data

In [4]:
# Kh·ªüi t·∫°o BM25 Search Engine
bm25_engine = BM25SearchEngine()

# Load data t·ª´ JSON files
documents = bm25_engine.load_data()

print(f"\n‚úÖ Loaded {len(documents)} documents successfully!")

üì• Loading documents from JSON files...
  ‚úì Loaded vnexpressT_bongda_part1.json: 473 documents


  ‚úì Loaded vnexpressT_bongda_part2.json: 488 documents
  ‚úì Loaded vnexpressT_bongda_part3.json: 487 documents
  ‚úì Loaded vnexpressT_bongda_part4.json: 308 documents

‚úì Total documents loaded: 1756

‚úÖ Loaded 1756 documents successfully!


## 5. Build BM25 Index

In [5]:
# Build BM25 index
bm25_engine.build_index()

# Show statistics
bm25_engine.get_statistics()


üèóÔ∏è Building BM25 index...


Processing documents: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1756/1756 [01:03<00:00, 27.76it/s]


‚úì BM25 index built with 1756 documents


üìä CORPUS STATISTICS
Total documents: 1756
Average tokens per document: 424.8
Vocabulary size: 15607

Top 10 most common words:
  nam: 8293
  ƒë·ªôi: 7750
  tr·∫≠n: 7581
  hai: 6792
  vi·ªát: 6762
  c·∫ßu_th·ªß: 6677
  b√≥ng: 5719
  hlv: 5612
  league: 5192
  nƒÉm: 5190


## 6. Search Queries

### 6.1. Example Search Query 1: "B√≥ng ƒë√° Vi·ªát Nam"

In [6]:
# Search query v·ªÅ b√≥ng ƒë√° Vi·ªát Nam
results = bm25_engine.search_and_display(
    query="b√≥ng ƒë√° vi·ªát nam",
    top_k=5,
    show_content=True,
    max_content_length=200
)


üîç SEARCHING FOR: "b√≥ng ƒë√° vi·ªát nam"


FOUND 5 RESULTS

[1] üèÜ SCORE: 3.8253
üì∞ Title: Honda t·∫∑ng xe cho hai ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° Vi·ªát Nam
üìÖ Date: Th·ª© nƒÉm, 12/12/2019, 15:45 (GMT+7) | ‚úçÔ∏è Author: Tu·∫•n V≈©
üîó URL: https://vnexpress.net/honda-tang-xe-cho-hai-doi-tuyen-bong-da-viet-nam-4026276.html
üìÑ Content: T·ªëi 11/12 v·ª´a qua, Honda Vi·ªát Nam ƒë√£ trao ph·∫ßn th∆∞·ªüng m·ª´ng chi·∫øn th·∫Øng c·ªßa ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° n·ªØ qu·ªëc gia v√† ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° nam U22 Vi·ªát Nam, t·∫°i VƒÉn ph√≤ng Ch√≠nh ph·ªß. Th·ªß t∆∞·ªõng Nguy·ªÖn Xu√¢n Ph√∫c c√πng...
----------------------------------------------------------------------------------------------------

[2] üèÜ SCORE: 3.6570
üì∞ Title: B√°o ch√≠ H√†n Qu·ªëc: ‚Äòƒê√¢y l√† b√≥ng ƒë√° Vi·ªát Nam sao?‚Äô
üìÖ Date: Th·ª© ba, 23/1/2018, 19:32 (GMT+7) | ‚úçÔ∏è Author: Xu√¢n B√¨nh
üîó URL: https://vnexpress.net/bao-chi-han-quoc-day-la-bong-da-viet-nam-sao-3702883.html
üìÑ Content: V·ªõi ti

### 6.2. Example Search Query 2: "Quang H·∫£i c·∫ßu th·ªß"

In [7]:
# Search query v·ªÅ c·∫ßu th·ªß Quang H·∫£i
results = bm25_engine.search_and_display(
    query="quang h·∫£i c·∫ßu th·ªß",
    top_k=5,
    show_content=True,
    max_content_length=200
)


üîç SEARCHING FOR: "quang h·∫£i c·∫ßu th·ªß"


FOUND 5 RESULTS

[1] üèÜ SCORE: 5.4361
üì∞ Title: Quang H·∫£i: ‚ÄòMong ƒë·∫øn ng√†y c√≥ c·∫ßu th·ªß Vi·ªát Nam ƒë√° ·ªü Ngo·∫°i h·∫°ng Anh‚Äô
üìÖ Date: Ch·ªß nh·∫≠t, 16/9/2018, 16:42 (GMT+7) | ‚úçÔ∏è Author: L√¢m Th·ªèa
üîó URL: https://vnexpress.net/quang-hai-mong-den-ngay-co-cau-thu-viet-nam-da-o-ngoai-hang-anh-3809841.html
üìÑ Content: Quang H·∫£i c√πng c·ª±u h·∫≠u v·ªá Joleon Lescott c·ªßa Man City v√† Cup v√¥ ƒë·ªãch Ngo·∫°i h·∫°ng Anh, League Cup. ‚ÄúT√¥i may m·∫Øn l√† c·∫ßu th·ªß Vi·ªát Nam ƒë·∫ßu ti√™n ƒë∆∞·ª£c ch·∫°m tay v√†o chi·∫øc Cup v√¥ ƒë·ªãch Ngo·∫°i h·∫°ng Anh v√† League ...
----------------------------------------------------------------------------------------------------

[2] üèÜ SCORE: 5.2683
üì∞ Title: Khi Lee Nguy·ªÖn v·∫´n l√† ng√¥i sao V-League
üìÖ Date: Th·ª© ba, 23/3/2021, 12:33 (GMT+7) | ‚úçÔ∏è Author: Song Vi·ªát
üîó URL: https://vnexpress.net/khi-lee-nguyen-van-la-ngoi-sao-v-league-4252605.html
üìÑ

### 6.3. Example Search Query 3: "V-League 2024"

In [8]:
# Search query v·ªÅ V-League
results = bm25_engine.search_and_display(
    query="v-league 2024",
    top_k=5,
    show_content=True,
    max_content_length=200
)


üîç SEARCHING FOR: "v-league 2024"


FOUND 5 RESULTS

[1] üèÜ SCORE: 4.4433
üì∞ Title: V-League 2024-2025 ch·ªâ t·∫°m d·ª´ng v√¨ FIFA days v√† AFF Cup
üìÖ Date: Th·ª© nƒÉm, 11/4/2024, 06:07 (GMT+7) | ‚úçÔ∏è Author: Hi·∫øu L∆∞∆°ng
üîó URL: https://vnexpress.net/v-league-2024-2025-chi-tam-dung-vi-fifa-days-va-aff-cup-4732862.html
üìÑ Content: ƒê·ªôi tuy·ªÉn Vi·ªát Nam s·∫Ω c√≥ nƒÉm ƒë·ª£t FIFA days, g·ªìm t·ª´ ng√†y 2/9 ƒë·∫øn 10/9, ng√†y 7/10 ƒë·∫øn 15/10, ng√†y 11/11 ƒë·∫øn 19/11, ng√†y 17/3/2025 ƒë·∫øn 25/3/2025 v√† ng√†y 2/6/2025 ƒë·∫øn 10/6/2025. Trong khi ƒë√≥, AFF Cup 2024...
----------------------------------------------------------------------------------------------------

[2] üèÜ SCORE: 4.4364
üì∞ Title: V-League ra t·ªëi h·∫≠u th∆∞ cho Qu·∫£ng Nam
üìÖ Date: Th·ª© t∆∞, 23/7/2025, 14:56 (GMT+7) | ‚úçÔ∏è Author: Hi·∫øu L∆∞∆°ng
üîó URL: https://vnexpress.net/v-league-ra-toi-hau-thu-cho-quang-nam-4917967.html
üìÑ Content: Li√™n ti·∫øp trong hai ng√†y 22/7 v√† 23/7, ƒë∆

### 6.4. Example Search Query 4: "HLV Park Hang-seo"

In [9]:
# Search query v·ªÅ HLV Park Hang-seo
results = bm25_engine.search_and_display(
    query="hlv park hang seo",
    top_k=5,
    show_content=True,
    max_content_length=200
)


üîç SEARCHING FOR: "hlv park hang seo"


FOUND 5 RESULTS

[1] üèÜ SCORE: 11.5520
üì∞ Title: HLV Park Hang-seo n√≥i g√¨ v·ªõi c√°c c·∫ßu th·ªß Vi·ªát Nam sau tr·∫≠n th·∫Øng Iraq
üìÖ Date: Ch·ªß nh·∫≠t, 21/1/2018, 20:22 (GMT+7) | ‚úçÔ∏è Author: L√¢m Th·ªèa
üîó URL: https://vnexpress.net/hlv-park-hang-seo-noi-gi-voi-cac-cau-thu-viet-nam-sau-tran-thang-iraq-3701712.html
üìÑ Content: HLV Park Hang-seo c√πng U23 Vi·ªát Nam vi·∫øt chuy·ªán c·ªï t√≠ch ·ªü v√≤ng chung k·∫øt U23 ch√¢u √Å. ·∫¢nh: Anh Khoa ‚ÄúM·ªôt su·∫•t ·ªü b√°n k·∫øt ƒë√£ thu·ªôc v·ªÅ c√°c b·∫°n. C√°c b·∫°n x·ª©ng ƒë√°ng. Kh√¥ng c√≥ may m·∫Øn n√†o l·∫≠p ƒë∆∞·ª£c k·ª≥ t√≠ch ƒë√¢u...
----------------------------------------------------------------------------------------------------

[2] üèÜ SCORE: 11.5289
üì∞ Title: C·∫ßu th·ªß Vi·ªát Nam t√¨m l·∫°i n·ª• c∆∞·ªùi sau tr·∫≠n thua Iraq
üìÖ Date: Th·ª© t∆∞, 9/1/2019, 16:51 (GMT+7) | ‚úçÔ∏è Author: ·∫¢nh:Anh Khoa
üîó URL: https://vnexpress.net/cau-thu-viet-nam-tim-lai-nu-c

### 6.5. Custom Search - Th·ª≠ nghi·ªám v·ªõi query c·ªßa b·∫°n

In [10]:
# Th·ª≠ nghi·ªám v·ªõi query t√πy ch·ªânh
custom_query = "ƒë·ªôi tuy·ªÉn vi·ªát nam"  # Thay ƒë·ªïi query n√†y theo √Ω b·∫°n

results = bm25_engine.search_and_display(
    query=custom_query,
    top_k=10,  # S·ªë l∆∞·ª£ng k·∫øt qu·∫£ mu·ªën hi·ªÉn th·ªã
    show_content=True,  # True ƒë·ªÉ show content, False ƒë·ªÉ ch·ªâ show title
    max_content_length=300  # ƒê·ªô d√†i t·ªëi ƒëa c·ªßa content snippet
)


üîç SEARCHING FOR: "ƒë·ªôi tuy·ªÉn vi·ªát nam"


FOUND 10 RESULTS

[1] üèÜ SCORE: 4.0826
üì∞ Title: Honda t·∫∑ng xe cho hai ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° Vi·ªát Nam
üìÖ Date: Th·ª© nƒÉm, 12/12/2019, 15:45 (GMT+7) | ‚úçÔ∏è Author: Tu·∫•n V≈©
üîó URL: https://vnexpress.net/honda-tang-xe-cho-hai-doi-tuyen-bong-da-viet-nam-4026276.html
üìÑ Content: T·ªëi 11/12 v·ª´a qua, Honda Vi·ªát Nam ƒë√£ trao ph·∫ßn th∆∞·ªüng m·ª´ng chi·∫øn th·∫Øng c·ªßa ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° n·ªØ qu·ªëc gia v√† ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° nam U22 Vi·ªát Nam, t·∫°i VƒÉn ph√≤ng Ch√≠nh ph·ªß. Th·ªß t∆∞·ªõng Nguy·ªÖn Xu√¢n Ph√∫c c√πng ƒë·∫°i di·ªán Honda Vi·ªát Nam trao t·∫∑ng d√†n xe Honda Lead cho ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° n·ªØ. Theo ƒë√≥, ƒë·ªôi tuy·ªÉn b√≥...
----------------------------------------------------------------------------------------------------

[2] üèÜ SCORE: 3.8101
üì∞ Title: HLV Park: 'T√¥i mu·ªën gi√∫p b√≥ng ƒë√° Vi·ªát Nam m·∫°nh h∆°n'
üìÖ Date: Th·ª© ba, 5/11/2019, 18:00 (GMT+7) | ‚úçÔ∏è A

## 7. Advanced: Compare Multiple Queries

In [11]:
# So s√°nh k·∫øt qu·∫£ c·ªßa nhi·ªÅu queries
test_queries = [
    "c√¥ng ph∆∞·ª£ng b√†n th·∫Øng",
    "asean cup 2024",
    "th·ªß m√¥n ƒë·∫∑ng vƒÉn l√¢m",
    "h√† n·ªôi fc v√¥ ƒë·ªãch"
]

print("\n" + "="*100)
print("COMPARING MULTIPLE QUERIES")
print("="*100)

for query in test_queries:
    results = bm25_engine.search(query, top_k=3)
    print(f"\nüîç Query: \"{query}\"")
    print(f"Top 3 results:")
    for r in results:
        print(f"  [{r['rank']}] Score: {r['score']:.4f} - {r['document'].get('title', 'No title')[:80]}")
    print("-" * 100)


COMPARING MULTIPLE QUERIES

üîç Query: "c√¥ng ph∆∞·ª£ng b√†n th·∫Øng"
Top 3 results:
  [1] Score: 11.4124 - HAGL ghi 36% s·ªë b√†n ph·∫°t ƒë·ªÅn ·ªü V-League 2021
  [2] Score: 11.3702 - V√≤ng 6 V-League v√† s·ª± kh√°c bi·ªát t·ª´ C√¥ng Ph∆∞·ª£ng
  [3] Score: 10.6776 - C√¥ng Ph∆∞·ª£ng r·ªùi tuy·ªÉn Vi·ªát Nam
----------------------------------------------------------------------------------------------------

üîç Query: "asean cup 2024"
Top 3 results:
  [1] Score: 4.7703 - Vi·ªát Nam th·∫Øng d·ªÖ tr·∫≠n ra qu√¢n v√≤ng lo·∫°i Asian Cup
  [2] Score: 4.7473 - 5 c·∫ßu th·ªß ƒë√°ng xem nh·∫•t b·∫£ng B ASEAN Cup 2024
  [3] Score: 4.6883 - ASEAN Cup c√≥ th·ªÉ ƒë√° v√†o gi·ªØa nƒÉm
----------------------------------------------------------------------------------------------------

üîç Query: "th·ªß m√¥n ƒë·∫∑ng vƒÉn l√¢m"
Top 3 results:
  [1] Score: 10.9399 - ƒê·∫∑ng VƒÉn L√¢m l√†m ƒë·ªôi tr∆∞·ªüng ·ªü V-League 2023
  [2] Score: 10.6954 - ƒê·∫∑ng VƒÉn L√¢m: 'G·∫∑p tuy·ªÉn Vi·ªát Nam kh√¥ng

## 8. Export Results to DataFrame

In [None]:
# Export k·∫øt qu·∫£ search th√†nh DataFrame ƒë·ªÉ d·ªÖ ph√¢n t√≠ch
def results_to_dataframe(results):
    """Convert search results to pandas DataFrame"""
    data = []
    for r in results:
        doc = r['document']
        data.append({
            'Rank': r['rank'],
            'Score': round(r['score'], 4),
            'Title': doc.get('title', 'No title'),
            'Date': doc.get('date', 'No date'),
            'Author': doc.get('author', 'Unknown'),
            'URL': doc.get('url', 'No URL')
        })
    return pd.DataFrame(data)

query = "b√≥ng ƒë√° vi·ªát nam"
results = bm25_engine.search(query, top_k=10)
df_results = results_to_dataframe(results)

print(f"\nüîç Query: \"{query}\"\n")
print(df_results.to_string(index=False))



üîç Query: "b√≥ng ƒë√° vi·ªát nam"

 Rank  Score                                                           Title                               Date               Author                                                                                              URL
    1 3.8253                Honda t·∫∑ng xe cho hai ƒë·ªôi tuy·ªÉn b√≥ng ƒë√° Vi·ªát Nam Th·ª© nƒÉm, 12/12/2019, 15:45 (GMT+7)              Tu·∫•n V≈©              https://vnexpress.net/honda-tang-xe-cho-hai-doi-tuyen-bong-da-viet-nam-4026276.html
    2 3.6570                B√°o ch√≠ H√†n Qu·ªëc: ‚Äòƒê√¢y l√† b√≥ng ƒë√° Vi·ªát Nam sao?‚Äô   Th·ª© ba, 23/1/2018, 19:32 (GMT+7)            Xu√¢n B√¨nh                  https://vnexpress.net/bao-chi-han-quoc-day-la-bong-da-viet-nam-sao-3702883.html
    3 3.6058         Nh·∫≠t xu·∫•t b·∫£n truy·ªán tranh ƒë·∫ßu ti√™n v·ªÅ b√≥ng ƒë√° Vi·ªát Nam   Th·ª© t∆∞, 30/3/2022, 13:56 (GMT+7) H·ªìng H·∫°nh(TheoTTXVN)       https://vnexpress.net/nhat-xuat-ban-truyen-tranh-dau-tien-ve-bong-da

## 9. Performance Analysis

In [13]:
import time

# ƒêo th·ªùi gian search
test_queries = [
    "b√≥ng ƒë√° vi·ªát nam",
    "quang h·∫£i",
    "v-league",
    "ƒë·ªôi tuy·ªÉn",
    "park hang seo"
]

print("\n‚è±Ô∏è PERFORMANCE ANALYSIS")
print("="*70)

total_time = 0
for query in test_queries:
    start_time = time.time()
    results = bm25_engine.search(query, top_k=10)
    elapsed_time = time.time() - start_time
    total_time += elapsed_time
    
    print(f"Query: '{query:30s}' | Time: {elapsed_time*1000:.2f}ms | Results: {len(results)}")

avg_time = total_time / len(test_queries)
print("="*70)
print(f"Average search time: {avg_time*1000:.2f}ms")
print(f"Total queries tested: {len(test_queries)}")


‚è±Ô∏è PERFORMANCE ANALYSIS
Query: 'b√≥ng ƒë√° vi·ªát nam              ' | Time: 10.65ms | Results: 10
Query: 'quang h·∫£i                     ' | Time: 6.00ms | Results: 10
Query: 'v-league                      ' | Time: 5.00ms | Results: 10
Query: 'ƒë·ªôi tuy·ªÉn                     ' | Time: 4.22ms | Results: 10
Query: 'park hang seo                 ' | Time: 3.19ms | Results: 10
Average search time: 5.81ms
Total queries tested: 5


## 11. DEMO - T√¨m T·ª´ Kh√≥a Trong Documents

### C√°ch d√πng: Thay ƒë·ªïi `KEYWORD` v√† ch·∫°y cell!

In [None]:
#  DEMO: T√åM T·ª™ KH√ìA TRONG DOCUMENTS
# ========================================

#  NH·∫¨P T·ª™ KH√ìA B·∫†N MU·ªêN T√åM ·ªû ƒê√ÇY:
KEYWORD = input("")  # Thay ƒë·ªïi t·ª´ kh√≥a n√†y v√† ch·∫°y l·∫°i cell

#  S·ªë l∆∞·ª£ng k·∫øt qu·∫£ mu·ªën xem:
TOP_N = 5

# ========================================

print(f"\n{'='*100}")
print(f" T√¨m ki·∫øm t·ª´ kh√≥a: '{KEYWORD}'")
print(f"{'='*100}\n")

keyword_lower = KEYWORD.lower()

# T√¨m documents ch·ª©a keyword
found_docs = []
for idx, doc in enumerate(bm25_engine.documents):
    title = doc.get('title', '')
    content = doc.get('content', '')
    full_text = f"{title} {content}".lower()
    
    if keyword_lower in full_text:
        count = full_text.count(keyword_lower)
        found_docs.append({
            'index': idx,
            'count': count,
            'doc': doc
        })

# S·∫Øp x·∫øp theo s·ªë l·∫ßn xu·∫•t hi·ªán
found_docs.sort(key=lambda x: x['count'], reverse=True)

# Hi·ªÉn th·ªã k·∫øt qu·∫£
if found_docs:
    print(f"T√¨m th·∫•y '{KEYWORD}' trong {len(found_docs)} documents!\n")
    print(f"{'='*100}")
    print(f"TOP {min(TOP_N, len(found_docs))} DOCUMENTS CH·ª®A '{KEYWORD}'")
    print(f"{'='*100}\n")
    
    for i, item in enumerate(found_docs[:TOP_N], 1):
        doc = item['doc']
        count = item['count']
        
        print(f"[{i}] Xu·∫•t hi·ªán: {count} l·∫ßn")
        print(f"Title: {doc.get('title', 'No title')}")
        print(f"Date: {doc.get('date', 'No date')}")
        
        # Hi·ªÉn th·ªã ƒëo·∫°n text ch·ª©a t·ª´ kh√≥a
        full_text = f"{doc.get('title', '')} {doc.get('content', '')}"
        pos = full_text.lower().find(keyword_lower)
        if pos != -1:
            start = max(0, pos - 80)
            end = min(len(full_text), pos + len(KEYWORD) + 80)
            snippet = full_text[start:end]
            if start > 0:
                snippet = "..." + snippet
            if end < len(full_text):
                snippet = snippet + "..."
            print(f"Snippet: {snippet}")
        
        print(f"{'-'*100}\n")
    
    # T·ªïng k·∫øt
    total = sum(item['count'] for item in found_docs)
    print(f"T·ªîNG K·∫æT:")
    print(f"  ‚úì T√¨m th·∫•y trong {len(found_docs)} documents")
    print(f"  ‚úì T·ªïng c·ªông xu·∫•t hi·ªán: {total} l·∫ßn")
    print(f"  ‚úì Trung b√¨nh: {total/len(found_docs):.1f} l·∫ßn/document")
else:
    print(f" Kh√¥ng t√¨m th·∫•y '{KEYWORD}' trong documents!")

print(f"\n{'='*100}")


üîç T√¨m ki·∫øm t·ª´ kh√≥a: 'M·ªπ ƒê√¨nh'

T√¨m th·∫•y 'M·ªπ ƒê√¨nh' trong 157 documents!

TOP 5 DOCUMENTS CH·ª®A 'M·ªπ ƒê√¨nh'

[1] Xu·∫•t hi·ªán: 10 l·∫ßn
Title: Th·∫ø kh√≥ c·ªßa Th·ªÉ C√¥ng v·ªõi s√¢n M·ªπ ƒê√¨nh
Date: Th·ª© ba, 1/4/2025, 21:02 (GMT+7)
Snippet: Th·∫ø kh√≥ c·ªßa Th·ªÉ C√¥ng v·ªõi s√¢n M·ªπ ƒê√¨nh Th·ªÉ C√¥ng t·ª´ng thu√™ s√¢n H√†ng ƒê·∫´y l√†m s√¢n nh√† t·ª´ nƒÉm 2019, c√πng H√† N·ªôi FC. T·ªõi m√π...
----------------------------------------------------------------------------------------------------

[2] Xu·∫•t hi·ªán: 8 l·∫ßn
Title: CƒêV khu·∫•y ƒë·ªông tr∆∞·ªõc tr·∫≠n Vi·ªát Nam - Indonesia
Date: Th·ª© ba, 26/3/2024, 17:56 (GMT+7)
Snippet: ...sao v√†ng... ... v√† chi√™ng, tr·ªëng ch·ªü ng∆∞·ªùi h√¢m m·ªô ƒëi tr√™n c√°c tuy·∫øn ph·ªë g·∫ßn s√¢n M·ªπ ƒê√¨nh. B·∫•t m√£n sau nh·ªØng m√†n tr√¨nh di·ªÖn g·∫ßn ƒë√¢y c·ªßa th·∫ßy tr√≤ Philippe Troussier, l∆∞·ª£n...
----------------------------------------------------------------------------------------------------

[3] 