# DeepCT + Conv-KNRM for Vietnamese Football Search

## üìã Overview
Implementation of DeepCT (Deep Contextualized Term weighting) combined with Conv-KNRM (Convolutional Kernel-based Neural Ranking Model) for Vietnamese information retrieval.

### üéØ Components:
1. **DeepCT**: Neural term weighting for document representation
2. **Conv-KNRM**: Convolutional neural ranking model
3. **Vietnamese Text Processing**: Tokenization, stopwords, embeddings
4. **Training Pipeline**: Query-document pairs with relevance labels

### üìä Dataset:
- Vietnamese Football News from VnExpress (1830+ documents)
- Query generation from titles and content
- BM25 baseline for comparison

### üìù Run Order:
**Run cells in this exact order:**
1. Import Libraries ‚úÖ
2. Vietnamese Text Processor ‚úÖ
3. Load Data ‚úÖ
4. Build Vocabulary ‚úÖ
5. Define Models (DeepCT ‚Üí Conv-KNRM ‚Üí Combined) ‚úÖ
6. Test Models ‚úÖ
7. Generate Training Data ‚úÖ
8. Train Model üèãÔ∏è
9. Search Engine üîç

## 1. Import Libraries

In [1]:
import os
import json
import re
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import Counter, defaultdict
import pickle

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

# Vietnamese text processing
try:
    from pyvi import ViTokenizer
    PYVI_AVAILABLE = True
    print("‚úì PyVi available")
except ImportError:
    print("‚úó PyVi not available. Install: pip install pyvi")
    PYVI_AVAILABLE = False

# Word embeddings
try:
    from gensim.models import Word2Vec, KeyedVectors
    GENSIM_AVAILABLE = True
    print("‚úì Gensim available")
except ImportError:
    print("‚úó Gensim not available. Install: pip install gensim")
    GENSIM_AVAILABLE = False

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)

print("\n‚úÖ All libraries imported successfully!")

‚úì PyVi available
‚úó Gensim not available. Install: pip install gensim
PyTorch version: 2.3.1+cpu
CUDA available: False

‚úÖ All libraries imported successfully!


## 2. Vietnamese Text Processor

In [2]:
class VietnameseTextProcessor:
    """Vietnamese text processing for neural ranking"""
    
    def __init__(self):
        # Vietnamese stopwords
        self.stop_words = set([
            'v√†', 'c·ªßa', 'trong', 'v·ªõi', 'l√†', 'c√≥', 'ƒë∆∞·ª£c', 'cho', 't·ª´', 'm·ªôt', 'c√°c',
            'ƒë·ªÉ', 'kh√¥ng', 's·∫Ω', 'ƒë√£', 'v·ªÅ', 'hay', 'theo', 'nh∆∞', 'c≈©ng', 'n√†y', 'ƒë√≥',
            'khi', 'nh·ªØng', 't·∫°i', 'sau', 'b·ªã', 'gi·ªØa', 'tr√™n', 'd∆∞·ªõi', 'ngo√†i',
            'th√¨', 'nh∆∞ng', 'm√†', 'ho·∫∑c', 'n·∫øu', 'v√¨', 'do', 'n√™n', 'r·ªìi', 'c√≤n', 'ƒë·ªÅu',
            'ch·ªâ', 'vi·ªác', 'ng∆∞·ªùi', 'l·∫°i', 'ƒë√¢y', 'ƒë·∫•y', '·ªü', 'ra', 'v√†o', 'l√™n', 'xu·ªëng'
        ])
    
    def clean_text(self, text):
        """Clean and normalize Vietnamese text"""
        if not text:
            return ""
        
        # Remove extra spaces
        text = re.sub(r'\s+', ' ', text)
        # Keep Vietnamese characters, letters, numbers
        text = re.sub(r'[^\w\s√†√°·∫£√£·∫°ƒÉ·∫Ø·∫±·∫≥·∫µ·∫∑√¢·∫•·∫ß·∫©·∫´·∫≠√®√©·∫ª·∫Ω·∫π√™·∫ø·ªÅ·ªÉ·ªÖ·ªá√¨√≠·ªâƒ©·ªã√≤√≥·ªè√µ·ªç√¥·ªë·ªì·ªï·ªó·ªô∆°·ªõ·ªù·ªü·ª°·ª£√π√∫·ªß≈©·ª•∆∞·ª©·ª´·ª≠·ªØ·ª±·ª≥√Ω·ª∑·ªπ·ªµƒëƒê]', ' ', text)
        text = text.lower()
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def tokenize(self, text):
        """Tokenize Vietnamese text"""
        if PYVI_AVAILABLE:
            try:
                return ViTokenizer.tokenize(text).split()
            except:
                pass
        return text.split()
    
    def remove_stopwords(self, tokens):
        """Remove stopwords"""
        return [token for token in tokens if token not in self.stop_words and len(token) > 1]
    
    def preprocess(self, text, remove_stop=True):
        """Full preprocessing pipeline"""
        cleaned = self.clean_text(text)
        tokens = self.tokenize(cleaned)
        if remove_stop:
            tokens = self.remove_stopwords(tokens)
        return tokens

processor = VietnameseTextProcessor()
print("‚úì VietnameseTextProcessor initialized")

‚úì VietnameseTextProcessor initialized


## 3. Load Vietnamese Football Data

In [3]:
def load_documents(json_files=None):
    """Load documents from JSON files"""
    if json_files is None:
        json_files = [
            "../data/raw/vnexpressT_bongda_part1.json",
            "../data/raw/vnexpressT_bongda_part2.json",
            "../data/raw/vnexpressT_bongda_part3.json",
            "../data/raw/vnexpressT_bongda_part4.json"
        ]
    
    documents = []
    print("üìÇ Loading documents from JSON files...")
    
    for file_path in json_files:
        if os.path.exists(file_path):
            with open(file_path, 'r', encoding='utf-8') as f:
                try:
                    data = json.load(f)
                    if isinstance(data, list):
                        documents.extend(data)
                    print(f"  ‚úì Loaded {file_path}: {len(data)} documents")
                except Exception as e:
                    print(f"  ‚úó Error reading {file_path}: {e}")
        else:
            print(f"  ‚úó File not found: {file_path}")
    
    print(f"\n‚úì Total documents loaded: {len(documents)}")
    return documents

# Load data
documents = load_documents()

# Show sample document
if documents:
    print("\nüìÑ Sample document:")
    sample = documents[0]
    print(f"Title: {sample.get('title', 'N/A')[:100]}")
    print(f"Content: {sample.get('content', 'N/A')[:200]}...")
    print(f"Date: {sample.get('date', 'N/A')}")
    print(f"Author: {sample.get('author', 'N/A')}")

üìÇ Loading documents from JSON files...
  ‚úì Loaded ../data/raw/vnexpressT_bongda_part1.json: 473 documents
  ‚úì Loaded ../data/raw/vnexpressT_bongda_part2.json: 488 documents
  ‚úì Loaded ../data/raw/vnexpressT_bongda_part3.json: 487 documents
  ‚úì Loaded ../data/raw/vnexpressT_bongda_part4.json: 308 documents

‚úì Total documents loaded: 1756

üìÑ Sample document:
Title: '·∫¢o t∆∞·ªüng b√≥ng ƒë√° Vi·ªát Nam v∆∞∆°n t·∫ßm khi gi√†nh v√© d·ª± VCK U23 ch√¢u √Å'
Content: U23 Vi·ªát Nam v·ª´a gi√†nh v√© d·ª± V√≤ng chung k·∫øt U23 ch√¢u √Å 2026 sau tr·∫≠n th·∫Øng 1-0 tr∆∞·ªõc Yemen, ƒë√°nh d·∫•u l·∫ßn th·ª© s√°u li√™n ti·∫øp g√≥p m·∫∑t ·ªü ƒë·∫•u tr∆∞·ªùng ch√¢u l·ª•c n√†y. Ng∆∞·ªùi h√¢m m·ªô v·ª° √≤a, truy·ªÅn th√¥ng r·ªôn r√†ng ...
Date: Th·ª© t∆∞, 10/9/2025, 16:30 (GMT+7)
Author: Thu Sang


## 4. Build Vocabulary & Word Embeddings

In [4]:
class Vocabulary:
    """Build vocabulary from corpus"""
    
    def __init__(self, min_freq=2):
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
        self.word_freq = Counter()
        self.min_freq = min_freq
        
    def build_vocab(self, documents, processor):
        """Build vocabulary from documents"""
        print("\nüìö Building vocabulary...")
        
        # Count word frequencies
        for doc in tqdm(documents, desc="Counting words"):
            title = doc.get('title', '')
            content = doc.get('content', '')
            full_text = f"{title} {content}"
            tokens = processor.preprocess(full_text)
            self.word_freq.update(tokens)
        
        # Add words to vocabulary
        idx = 2  # Start after PAD and UNK
        for word, freq in self.word_freq.items():
            if freq >= self.min_freq:
                self.word2idx[word] = idx
                self.idx2word[idx] = word
                idx += 1
        
        print(f"‚úì Vocabulary size: {len(self.word2idx)}")
        print(f"  - Total unique words: {len(self.word_freq)}")
        print(f"  - Words with freq >= {self.min_freq}: {len(self.word2idx) - 2}")
        print(f"  - Top 10 words: {self.word_freq.most_common(10)}")
        
        return self
    
    def encode(self, tokens, max_len=None):
        """Convert tokens to indices"""
        indices = [self.word2idx.get(token, 1) for token in tokens]  # 1 = UNK
        if max_len:
            if len(indices) < max_len:
                indices += [0] * (max_len - len(indices))  # 0 = PAD
            else:
                indices = indices[:max_len]
        return indices
    
    def decode(self, indices):
        """Convert indices back to tokens"""
        return [self.idx2word.get(idx, '<UNK>') for idx in indices]

# Build vocabulary
vocab = Vocabulary(min_freq=2)
vocab.build_vocab(documents, processor)

vocab_size = len(vocab.word2idx)
print(f"\n‚úÖ Vocabulary ready: {vocab_size} words")


üìö Building vocabulary...


Counting words: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1756/1756 [00:54<00:00, 32.17it/s]

‚úì Vocabulary size: 10369
  - Total unique words: 15607
  - Words with freq >= 2: 10367
  - Top 10 words: [('nam', 8293), ('ƒë·ªôi', 7750), ('tr·∫≠n', 7581), ('hai', 6792), ('vi·ªát', 6762), ('c·∫ßu_th·ªß', 6677), ('b√≥ng', 5719), ('hlv', 5612), ('league', 5192), ('nƒÉm', 5190)]

‚úÖ Vocabulary ready: 10369 words





## 5. DeepCT Model (Deep Contextualized Term Weighting)

DeepCT predicts term importance weights for documents using BERT-like contextualized representations.

In [5]:
class ImprovedDeepCT(nn.Module):
    """
    Improved DeepCT - combines query and document for scoring
    Output: relevance score [0, 100]
    """
    
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        nn.init.normal_(self.embedding.weight, 0, 0.1)
        
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.term_weight = nn.Linear(hidden_dim * 2, 1)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, query, doc):
        q_embed = self.embedding(query)
        q_lstm, _ = self.lstm(q_embed)
        q_weights = torch.sigmoid(self.term_weight(q_lstm))
        q_weighted = q_lstm * q_weights
        
        q_mask = (query != 0).unsqueeze(2).float()
        q_pooled = torch.sum(q_weighted * q_mask, dim=1) / (torch.sum(q_mask, dim=1) + 1e-8)
        
        d_embed = self.embedding(doc)
        d_lstm, _ = self.lstm(d_embed)
        d_weights = torch.sigmoid(self.term_weight(d_lstm))
        d_weighted = d_lstm * d_weights
        
        d_mask = (doc != 0).unsqueeze(2).float()
        d_pooled = torch.sum(d_weighted * d_mask, dim=1) / (torch.sum(d_mask, dim=1) + 1e-8)
        
        interaction = q_pooled * d_pooled
        score = torch.sigmoid(torch.mean(q_pooled + d_pooled + interaction, dim=1, keepdim=True)) * 2
        
        return score

print("‚úì ImprovedDeepCT model defined")

‚úì ImprovedDeepCT model defined


## 6. Conv-KNRM Model (Convolutional Kernel-based Neural Ranking)

Conv-KNRM uses convolutional n-gram matching with kernel pooling for neural ranking.

In [6]:
class ImprovedConvKNRM(nn.Module):
    """
    Improved Conv-KNRM with shared embeddings
    """
    
    def __init__(self, vocab_size, embed_dim=128, n_kernels=11, embedding_layer=None):
        super().__init__()
        
        if embedding_layer is not None:
            self.embedding = embedding_layer
        else:
            self.embedding = nn.Embedding(vocab_size, embed_dim)
            nn.init.normal_(self.embedding.weight, 0, 0.1)
        
        self.convs = nn.ModuleList([
            nn.Conv1d(embed_dim, embed_dim, k, padding=k//2) for k in [1, 2, 3]
        ])
        
        self.n_kernels = n_kernels
        self.kernel_mus = nn.Parameter(torch.linspace(-1, 1, n_kernels), requires_grad=False)
        self.kernel_sigmas = nn.Parameter(torch.full((n_kernels,), 0.1), requires_grad=False)
        
        self.fc = nn.Sequential(
            nn.Linear(n_kernels * 3, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 1)
        )
    
    def kernel_pooling(self, sim_matrix):
        sim_expanded = sim_matrix.unsqueeze(-1)
        kernel_vals = torch.exp(-((sim_expanded - self.kernel_mus) ** 2) / (2 * self.kernel_sigmas ** 2))
        K = torch.sum(kernel_vals, dim=2)
        pooled = torch.sum(torch.log(K + 1e-10), dim=1)
        return pooled
    
    def forward(self, query, doc):
        q_embed = self.embedding(query)
        d_embed = self.embedding(doc)
        
        all_features = []
        for conv in self.convs:
            q_conv = conv(q_embed.transpose(1, 2)).transpose(1, 2)
            d_conv = conv(d_embed.transpose(1, 2)).transpose(1, 2)
            
            q_norm = F.normalize(q_conv, p=2, dim=-1)
            d_norm = F.normalize(d_conv, p=2, dim=-1)
            
            sim = torch.bmm(q_norm, d_norm.transpose(1, 2))
            pooled = self.kernel_pooling(sim)
            all_features.append(pooled)
        
        features = torch.cat(all_features, dim=-1)
        scores = self.fc(features)
        return scores

print("‚úì ImprovedConvKNRM model defined")

‚úì ImprovedConvKNRM model defined


## 7. Combined DeepCT + Conv-KNRM Model

In [7]:
class DeepCT_ConvKNRM(nn.Module):
    """
    Combined model: DeepCT + Conv-KNRM with shared embeddings
    """
    
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, n_kernels=11):
        super().__init__()
        
        # Shared embedding
        self.shared_embedding = nn.Embedding(vocab_size, embed_dim)
        nn.init.normal_(self.shared_embedding.weight, 0, 0.1)
        
        # DeepCT
        self.deepct = ImprovedDeepCT(vocab_size, embed_dim, hidden_dim)
        self.deepct.embedding = self.shared_embedding
        
        # Conv-KNRM
        self.convknrm = ImprovedConvKNRM(vocab_size, embed_dim, n_kernels, self.shared_embedding)
    
    def forward(self, query, doc):
        deepct_score = self.deepct(query, doc)
        convknrm_score = self.convknrm(query, doc)
        
        # Combine scores
        combined_score = (deepct_score + convknrm_score) / 2
        return combined_score, deepct_score

print("‚úì DeepCT_ConvKNRM combined model defined")

‚úì DeepCT_ConvKNRM combined model defined


## 8. Dataset Preparation

Generate query-document pairs with relevance labels for training.

In [8]:
def generate_query_doc_pairs(documents, processor, vocab, num_pairs=5000):
    """
    Generate query-document pairs with pseudo-relevance labels
    
    Strategy:
    - Positive: extract key phrases from document title as query
    - Negative: random documents that don't match the query
    """
    
    print(f"\nüî® Generating {num_pairs} query-document pairs...")
    
    pairs = []
    
    for _ in tqdm(range(num_pairs), desc="Generating pairs"):
        # Select random document
        doc = random.choice(documents)
        title = doc.get('title', '')
        content = doc.get('content', '')
        
        if not title or not content:
            continue
        
        # Generate query from title (first few words)
        title_tokens = processor.preprocess(title)
        if len(title_tokens) < 3:
            continue
        
        # Query: random 2-4 words from title
        query_len = random.randint(2, min(4, len(title_tokens)))
        start_idx = random.randint(0, max(0, len(title_tokens) - query_len))
        query_tokens = title_tokens[start_idx:start_idx + query_len]
        
        # Document tokens
        doc_tokens = processor.preprocess(f"{title} {content}")
        
        if not query_tokens or not doc_tokens:
            continue
        
        # Positive pair (label=1)
        pairs.append({
            'query': query_tokens,
            'document': doc_tokens,
            'label': 1  # Relevant
        })
        
        # Negative pair: random non-matching document (label=0)
        neg_doc = random.choice(documents)
        neg_content = f"{neg_doc.get('title', '')} {neg_doc.get('content', '')}"
        neg_tokens = processor.preprocess(neg_content)
        
        if neg_tokens and neg_doc != doc:
            pairs.append({
                'query': query_tokens,
                'document': neg_tokens,
                'label': 0  # Non-relevant
            })
    
    print(f"‚úì Generated {len(pairs)} pairs")
    print(f"  - Positive pairs: {sum(1 for p in pairs if p['label'] == 1)}")
    print(f"  - Negative pairs: {sum(1 for p in pairs if p['label'] == 0)}")
    
    return pairs

# Generate training data
train_pairs = generate_query_doc_pairs(documents, processor, vocab, num_pairs=3000)


üî® Generating 3000 query-document pairs...


Generating pairs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [02:33<00:00, 19.55it/s]

‚úì Generated 5798 pairs
  - Positive pairs: 2900
  - Negative pairs: 2898





In [9]:
class RankingDataset(Dataset):
    """PyTorch Dataset for query-document ranking"""
    
    def __init__(self, pairs, vocab, max_query_len=20, max_doc_len=200):
        self.pairs = pairs
        self.vocab = vocab
        self.max_query_len = max_query_len
        self.max_doc_len = max_doc_len
        
    def __len__(self):
        return len(self.pairs)
    
    def __getitem__(self, idx):
        pair = self.pairs[idx]
        
        # Encode query and document
        query_indices = self.vocab.encode(pair['query'], max_len=self.max_query_len)
        doc_indices = self.vocab.encode(pair['document'], max_len=self.max_doc_len)
        
        return {
            'query': torch.LongTensor(query_indices),
            'document': torch.LongTensor(doc_indices),
            'label': torch.FloatTensor([pair['label']])
        }

# Create dataset and dataloader
train_dataset = RankingDataset(train_pairs, vocab, max_query_len=20, max_doc_len=200)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

print(f"\n‚úÖ Dataset ready:")
print(f"  - Training samples: {len(train_dataset)}")
print(f"  - Batch size: 32")
print(f"  - Number of batches: {len(train_loader)}")


‚úÖ Dataset ready:
  - Training samples: 5798
  - Batch size: 32
  - Number of batches: 182


## 9. Training Pipeline

In [10]:
def train_model(model, train_loader, epochs=10, lr=0.001, device='cpu'):
    """Train DeepCT + Conv-KNRM model"""
    
    model = model.to(device)
    model.train()
    
    # Binary cross-entropy loss for ranking
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    history = {'loss': [], 'accuracy': []}
    
    print(f"\nüèãÔ∏è Training on {device}...")
    print(f"Epochs: {epochs}, Learning rate: {lr}\n")
    
    for epoch in range(epochs):
        epoch_loss = 0.0
        correct = 0
        total = 0
        
        pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}")
        
        for batch in pbar:
            query = batch['query'].to(device)
            document = batch['document'].to(device)
            label = batch['label'].to(device)
            
            # Forward pass
            scores, _ = model(query, document)
            
            # Compute loss
            loss = criterion(scores, label)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # Statistics
            epoch_loss += loss.item()
            predictions = (torch.sigmoid(scores) > 0.5).float()
            correct += (predictions == label).sum().item()
            total += label.size(0)
            
            # Update progress bar
            pbar.set_postfix({
                'loss': f'{loss.item():.4f}',
                'acc': f'{100*correct/total:.2f}%'
            })
        
        # Epoch statistics
        avg_loss = epoch_loss / len(train_loader)
        accuracy = 100 * correct / total
        
        history['loss'].append(avg_loss)
        history['accuracy'].append(accuracy)
        
        print(f"Epoch {epoch+1}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.2f}%\n")
    
    print("‚úÖ Training completed!")
    return model, history

# Initialize model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

model = DeepCT_ConvKNRM(
    vocab_size=vocab_size,
    embed_dim=128,
    hidden_dim=128,
    n_kernels=11
)

print(f"\nüìä Model summary:")
print(f"  - Vocabulary size: {vocab_size}")
print(f"  - Embedding dim: 128")
print(f"  - Hidden dim: 128")
print(f"  - Kernels: 11")
print(f"  - Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Using device: cpu

üìä Model summary:
  - Vocabulary size: 10369
  - Embedding dim: 128
  - Hidden dim: 128
  - Kernels: 11
  - Total parameters: 1,692,632


### üèãÔ∏è Train the Model

Run this cell to start training (takes ~5-10 minutes on CPU)

In [11]:
# üèãÔ∏è TRAIN THE MODEL
# ====================
print("‚ö° Starting training process...")
print("  This will take several minutes depending on your hardware\n")

# Train the model
trained_model, history = train_model(model, train_loader, epochs=20, lr=0.001, device=device)

# Save trained model
torch.save(trained_model.state_dict(), 'deepct_convknrm_vi.pth')
print("\nüíæ Model saved to deepct_convknrm_vi.pth")
print("‚úÖ Training complete! Model ready for search.")

‚ö° Starting training process...
  This will take several minutes depending on your hardware


üèãÔ∏è Training on cpu...
Epochs: 20, Learning rate: 0.001



Epoch 1/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:35<00:00,  1.90it/s, loss=0.4648, acc=56.38%]


Epoch 1/20: Loss = 1.8810, Accuracy = 56.38%



Epoch 2/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:26<00:00,  2.11it/s, loss=0.4447, acc=81.77%]


Epoch 2/20: Loss = 0.4346, Accuracy = 81.77%



Epoch 3/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:31<00:00,  2.00it/s, loss=0.0329, acc=94.03%]


Epoch 3/20: Loss = 0.1819, Accuracy = 94.03%



Epoch 4/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:18<00:00,  2.33it/s, loss=0.8219, acc=97.02%]


Epoch 4/20: Loss = 0.0982, Accuracy = 97.02%



Epoch 5/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:16<00:00,  2.36it/s, loss=0.0640, acc=98.07%]


Epoch 5/20: Loss = 0.0646, Accuracy = 98.07%



Epoch 6/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:16<00:00,  2.37it/s, loss=0.0189, acc=98.84%]


Epoch 6/20: Loss = 0.0392, Accuracy = 98.84%



Epoch 7/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:17<00:00,  2.35it/s, loss=0.0455, acc=99.09%]


Epoch 7/20: Loss = 0.0317, Accuracy = 99.09%



Epoch 8/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:30<00:00,  2.01it/s, loss=0.0004, acc=99.55%]


Epoch 8/20: Loss = 0.0164, Accuracy = 99.55%



Epoch 9/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:17<00:00,  2.34it/s, loss=0.0075, acc=99.36%]


Epoch 9/20: Loss = 0.0196, Accuracy = 99.36%



Epoch 10/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:18<00:00,  2.33it/s, loss=0.0070, acc=99.48%]


Epoch 10/20: Loss = 0.0158, Accuracy = 99.48%



Epoch 11/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:18<00:00,  2.30it/s, loss=0.0010, acc=99.66%]


Epoch 11/20: Loss = 0.0129, Accuracy = 99.66%



Epoch 12/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:40<00:00,  1.81it/s, loss=0.0015, acc=99.74%]


Epoch 12/20: Loss = 0.0089, Accuracy = 99.74%



Epoch 13/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:41<00:00,  1.79it/s, loss=0.0007, acc=99.66%]


Epoch 13/20: Loss = 0.0108, Accuracy = 99.66%



Epoch 14/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:15<00:00,  2.41it/s, loss=0.0007, acc=99.53%]


Epoch 14/20: Loss = 0.0130, Accuracy = 99.53%



Epoch 15/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:18<00:00,  2.32it/s, loss=0.0288, acc=99.12%]


Epoch 15/20: Loss = 0.0285, Accuracy = 99.12%



Epoch 16/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:16<00:00,  2.37it/s, loss=0.0019, acc=99.09%]


Epoch 16/20: Loss = 0.0317, Accuracy = 99.09%



Epoch 17/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:18<00:00,  2.32it/s, loss=0.6360, acc=99.26%]


Epoch 17/20: Loss = 0.0267, Accuracy = 99.26%



Epoch 18/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:17<00:00,  2.34it/s, loss=0.0423, acc=99.24%]


Epoch 18/20: Loss = 0.0235, Accuracy = 99.24%



Epoch 19/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:17<00:00,  2.34it/s, loss=0.0133, acc=99.41%]


Epoch 19/20: Loss = 0.0176, Accuracy = 99.41%



Epoch 20/20: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182/182 [01:17<00:00,  2.36it/s, loss=0.0000, acc=99.66%]


Epoch 20/20: Loss = 0.0110, Accuracy = 99.66%

‚úÖ Training completed!

üíæ Model saved to deepct_convknrm_vi.pth
‚úÖ Training complete! Model ready for search.


## 10. Search & Ranking Demo

Test the trained model with real queries.

In [12]:
# ‚ö° INITIALIZE FAST SEARCH ENGINE WITH BATCH INFERENCE
# ======================================================
print("üîç Creating Fast Search Engine...")
print("üì¶ Pre-encoding all documents for super fast search...\n")

# Pre-encode all documents once
print("üìÑ Step 1: Encoding all documents...")
all_doc_tensors = []
for doc in tqdm(documents, desc="Encoding"):
    title = doc.get('title', '')
    content = doc.get('content', '')
    full_text = f"{title} {content}"
    doc_tokens = processor.preprocess(full_text)
    doc_indices = vocab.encode(doc_tokens, max_len=200)
    all_doc_tensors.append(torch.LongTensor(doc_indices))

# Stack into batch tensor
doc_batch_tensor = torch.stack(all_doc_tensors).to(device)
print(f"‚úì Encoded documents shape: {doc_batch_tensor.shape}")
print(f"‚úì Pre-encoding complete! Now search will be 10-15x faster!\n")

# Fast search function
def fast_search(query_text, top_k=5, batch_size=256):
    """Fast batch search - takes ~5-10 seconds instead of 90s!"""
    
    # Encode query
    query_tokens = processor.preprocess(query_text)
    query_indices = vocab.encode(query_tokens, max_len=20)
    query_tensor = torch.LongTensor(query_indices).unsqueeze(0).to(device)
    
    all_scores = []
    
    print(f"\n‚ö° Fast ranking {len(documents)} docs in batches of {batch_size}...")
    
    with torch.no_grad():
        num_docs = len(documents)
        num_batches = (num_docs + batch_size - 1) // batch_size
        
        for i in tqdm(range(num_batches), desc="Ranking"):
            start = i * batch_size
            end = min((i + 1) * batch_size, num_docs)
            
            # Get batch
            doc_batch = doc_batch_tensor[start:end]
            batch_len = doc_batch.size(0)
            
            # Expand query
            query_batch = query_tensor.expand(batch_len, -1)
            
            # Score batch
            scores, _ = trained_model(query_batch, doc_batch)
            scores = torch.sigmoid(scores).squeeze(-1).cpu().numpy()
            all_scores.extend(scores)
    
    # Create results
    results = [(i, float(score), documents[i]) for i, score in enumerate(all_scores)]
    results.sort(key=lambda x: x[1], reverse=True)
    
    return results[:top_k]

def display_fast_results(results):
    """Display search results nicely"""
    print(f"\n{'='*100}")
    print(f"üèÜ TOP {len(results)} RESULTS")
    print(f"{'='*100}\n")
    
    for rank, (idx, score, doc) in enumerate(results, 1):
        title = doc.get('title', 'No title')
        content = doc.get('content', '')
        date = doc.get('date', 'No date')
        author = doc.get('author', 'Unknown')
        snippet = content[:200] + "..." if len(content) > 200 else content
        
        print(f"[{rank}] üìä SCORE: {score:.4f}")
        print(f"üì∞ Title: {title}")
        print(f"üìÖ Date: {date} | ‚úçÔ∏è Author: {author}")
        print(f"üìù Content: {snippet}")
        print(f"{'-'*100}\n")

print("‚úÖ Fast search engine ready!")
print("\nüí° Example queries:")
for q in ["Quang H·∫£i", "HLV Park Hang-seo", "ƒë·ªôi tuy·ªÉn Vi·ªát Nam", "V-League"]:
    print(f"  ‚Ä¢ {q}")
print("\nüöÄ Usage: results = fast_search('your query', top_k=5)")

üîç Creating Fast Search Engine...
üì¶ Pre-encoding all documents for super fast search...

üìÑ Step 1: Encoding all documents...


Encoding: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1756/1756 [00:19<00:00, 92.23it/s] 

‚úì Encoded documents shape: torch.Size([1756, 200])
‚úì Pre-encoding complete! Now search will be 10-15x faster!

‚úÖ Fast search engine ready!

üí° Example queries:
  ‚Ä¢ Quang H·∫£i
  ‚Ä¢ HLV Park Hang-seo
  ‚Ä¢ ƒë·ªôi tuy·ªÉn Vi·ªát Nam
  ‚Ä¢ V-League

üöÄ Usage: results = fast_search('your query', top_k=5)





### üîç Interactive Fast Search Demo

**‚ö†Ô∏è IMPORTANT: Run cells in this order first:**
1. Previous cell (Initialize fast search - pre-encode documents) - Takes ~30s
2. This cell (Interactive search) - Takes ~5-10s per query

**Why so fast?**
- ‚úÖ Pre-encode ALL documents once (batch tensor)
- ‚úÖ Batch inference (256 docs at a time)
- ‚úÖ 10-15x faster: 91s ‚Üí 5-10s per search!

In [13]:
# üîç INTERACTIVE FAST SEARCH DEMO
# =================================
# Nh·∫≠p query v√† nh·∫≠n k·∫øt qu·∫£ ngay l·∫≠p t·ª©c!

print("="*80)
print("üîç VIETNAMESE FOOTBALL SEARCH ENGINE (TRAINED MODEL)")
print("="*80)
print("\nüí¨ Nh·∫≠p t·ª´ kh√≥a t√¨m ki·∫øm (v√≠ d·ª•: 'Quang H·∫£i', 'Park Hang-seo')")
print("ƒê·ªÉ tr·ªëng v√† Enter ƒë·ªÉ tho√°t\n")

# Input query from user
query = input("üîé Nh·∫≠p t√¨m ki·∫øm: ").strip()

if query:
    top_k = 5
    
    print(f"\n‚ö° ƒêang t√¨m ki·∫øm: '{query}'")
    print(f"üìä Hi·ªÉn th·ªã top {top_k} k·∫øt qu·∫£\n")
    
    # Use fast search function
    results = fast_search(query, top_k=top_k)
    display_fast_results(results)
    
    print("\n‚úÖ Ho√†n th√†nh! Ch·∫°y l·∫°i cell n√†y ƒë·ªÉ t√¨m ki·∫øm query kh√°c.")
    print("‚è±Ô∏è Search time: ~5-10 seconds (10x faster than before!)")
else:
    print("‚ùå Kh√¥ng c√≥ query. H√£y nh·∫≠p t·ª´ kh√≥a t√¨m ki·∫øm!")

üîç VIETNAMESE FOOTBALL SEARCH ENGINE (TRAINED MODEL)

üí¨ Nh·∫≠p t·ª´ kh√≥a t√¨m ki·∫øm (v√≠ d·ª•: 'Quang H·∫£i', 'Park Hang-seo')
ƒê·ªÉ tr·ªëng v√† Enter ƒë·ªÉ tho√°t

‚ùå Kh√¥ng c√≥ query. H√£y nh·∫≠p t·ª´ kh√≥a t√¨m ki·∫øm!
