# Notebook 06: Full Inference for Submission
**M·ª•c ti√™u:**
1. Ch·∫°y pipeline d·ª± ƒëo√°n tr√™n **T·∫§T C·∫¢** c√°c t·∫≠p d·ªØ li·ªáu: Train, Validation, v√† Test.
2. Sinh ra file `pred.json` cho t·ª´ng b√†i b√°o.
3. S·∫Øp x·∫øp k·∫øt qu·∫£ v√†o ƒë√∫ng c·∫•u tr√∫c th∆∞ m·ª•c ƒë·ªÉ n·ªôp b√†i (Submission Ready).

**Model:** Decision Tree/XGBoost v·ªõi **6 selected features**

**C·∫•u tr√∫c Output:**
```bash
submission_final/
‚îú‚îÄ‚îÄ <Student_ID>/
‚îÇ   ‚îú‚îÄ‚îÄ <paper_id_1>/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ pred.json
‚îÇ   ‚îú‚îÄ‚îÄ <paper_id_2>/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ pred.json
...
```

In [1]:
import pandas as pd
import numpy as np
import json
import os
import sys
import joblib
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Setup path ƒë·ªÉ import src module
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

# Import t·ª´ src.ml module (S·ª≠ d·ª•ng l·∫°i c√°c h√†m ƒë√£ refactor)
from src.ml import (
    load_json,
    save_json,
    normalize_text_basic,
    compute_pairwise_features,
    compute_tfidf_cosine_single,
    parse_bibtex_smart,
    transform_to_paper_based
)

pd.set_option('display.max_columns', None)
print("‚úÖ Libraries v√† src.ml modules imported successfully!")

# --- DANH S√ÅCH 6 FEATURES ƒê√É CH·ªåN (ph·∫£i kh·ªõp v·ªõi Notebook 03/04) ---
SELECTED_FEATURES = [
    'feat_title_tfidf_cosine',
    'feat_title_len_diff',
    'feat_auth_jaccard',
    'feat_year_match',
    'feat_id_match',
    'feat_first_auth_match',
]
print(f"üìù S·ª≠ d·ª•ng {len(SELECTED_FEATURES)} selected features")

‚úÖ Libraries v√† src.ml modules imported successfully!
üìù S·ª≠ d·ª•ng 6 selected features


In [2]:
# --- C·∫§U H√åNH QUAN TR·ªåNG ---

# 1. Th√¥ng tin sinh vi√™n (ƒê·ªÉ t·∫°o folder n·ªôp b√†i)
STUDENT_ID = "23127011" 

# 2. ƒê∆∞·ªùng d·∫´n d·ªØ li·ªáu
DATASET_ROOT = '../../dataset_final'  # Ch·ª©a labels.json cho groundtruth
DATA_FOLDER = '../../data_output_v2'  # Ch·ª©a refs.bib + references.json ƒë·∫ßy ƒë·ªß
PARTITIONS = ['train', 'validation', 'test']

# 3. ƒê∆∞·ªùng d·∫´n Model
MODEL_PATH = '../../dataset_final/models/best_matcher.pkl'
FEATURE_NAME_PATH = '../../dataset_final/models/feature_names.pkl'

# 4. Th∆∞ m·ª•c Output
SUBMISSION_DIR = f'submission_final/{STUDENT_ID}'

if not os.path.exists(SUBMISSION_DIR):
    os.makedirs(SUBMISSION_DIR)
    print(f"üìÅ Created submission directory: {os.path.abspath(SUBMISSION_DIR)}")
else:
    print(f"üìÅ Saving to existing directory: {os.path.abspath(SUBMISSION_DIR)}")

print(f"üìÇ Dataset (labels): {os.path.abspath(DATASET_ROOT)}")
print(f"üìÇ Data (refs+references): {os.path.abspath(DATA_FOLDER)}")


üìÅ Created submission directory: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011\notebooks\submission_final\23127011
üìÇ Dataset (labels): d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\dataset_final
üìÇ Data (refs+references): d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_output_v2


In [3]:
# --- OPTIMIZED HELPER FUNCTIONS WITH TIMING ---
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import time

# Global timing stats
TIMING_STATS = {
    'feature_compute': 0.0,
    'tfidf_compute': 0.0,
    'model_predict': 0.0,
    'total_queries': 0
}

def batch_compute_features(pairs_list):
    """Compute features cho batch pairs - VECTORIZED."""
    if not pairs_list:
        return []
    
    results = []
    for row in pairs_list:
        feats = compute_pairwise_features(row)
        results.append(feats)
    return results


def rank_paper_fast(queries, candidates_list, model, feature_names):
    """
    Ranking t·∫•t c·∫£ queries c·ªßa 1 paper c√πng l√∫c - SUPER OPTIMIZED.
    
    KEY OPTIMIZATION: 
    - Pre-compute TF-IDF vectors cho candidates 1 L·∫¶N
    - Reuse cho t·∫•t c·∫£ queries
    """
    global TIMING_STATS
    
    if not candidates_list:
        return {}
    
    predictions = {}
    
    # Pre-normalize titles 1 l·∫ßn
    cand_titles = [normalize_text_basic(c.get('cand_title', '')) for c in candidates_list]
    cand_ids = [c['cand_id'] for c in candidates_list]
    query_titles = [normalize_text_basic(q.get('bib_title', '')) for q in queries]
    
    # === PRE-COMPUTE TF-IDF CHO TO√ÄN B·ªò PAPER (1 L·∫¶N DUY NH·∫§T) ===
    t_tfidf_start = time.time()
    
    # G·ªôp t·∫•t c·∫£ unique titles (queries + candidates)
    all_titles = list(set(query_titles + cand_titles))
    
    try:
        # Fit vectorizer 1 l·∫ßn duy nh·∫•t cho paper n√†y
        vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 4), min_df=1)
        vectorizer.fit(all_titles)
        
        # Transform candidates 1 l·∫ßn
        cand_vecs = vectorizer.transform(cand_titles)
        
        # Transform queries 1 l·∫ßn  
        query_vecs = vectorizer.transform(query_titles)
        
        # Pre-compute t·∫•t c·∫£ TF-IDF scores (queries x candidates)
        # Shape: (n_queries, n_candidates)
        tfidf_matrix = cosine_similarity(query_vecs, cand_vecs)
        
    except Exception as e:
        # Fallback n·∫øu l·ªói
        tfidf_matrix = np.zeros((len(queries), len(candidates_list)))
    
    TIMING_STATS['tfidf_compute'] += time.time() - t_tfidf_start
    
    # === PROCESS T·ª™NG QUERY (ƒë√£ c√≥ TF-IDF s·∫µn) ===
    for q_idx, query in enumerate(queries):
        bib_key = query['key']
        
        # T·∫°o pairs
        pairs = []
        for cand in candidates_list:
            row = {}
            row.update(query)
            row.update(cand)
            pairs.append(row)
        
        # --- PHASE 1: Feature Compute (KH√îNG c√≥ TF-IDF) ---
        t0 = time.time()
        feats_list = batch_compute_features(pairs)
        TIMING_STATS['feature_compute'] += time.time() - t0
        
        # Add TF-IDF t·ª´ pre-computed matrix
        for c_idx, feats in enumerate(feats_list):
            feats['feat_title_tfidf_cosine'] = tfidf_matrix[q_idx, c_idx]
        
        # Create DataFrame v√† predict
        df_feats = pd.DataFrame(feats_list)
        
        # Fill missing cols
        for col in feature_names:
            if col not in df_feats.columns:
                df_feats[col] = 0.0
        
        X_input = df_feats[feature_names].values
        
        # --- PHASE 2: Model Predict ---
        t2 = time.time()
        if model:
            scores = model.predict_proba(X_input)[:, 1]
        else:
            scores = np.random.rand(len(pairs))
        TIMING_STATS['model_predict'] += time.time() - t2
        
        # Ranking
        ranked_idx = np.argsort(scores)[::-1][:5]
        top_5 = [cand_ids[i] for i in ranked_idx]
        
        predictions[bib_key] = top_5
        TIMING_STATS['total_queries'] += 1
    
    return predictions


def print_timing_stats():
    """In th·ªëng k√™ th·ªùi gian c·ªßa t·ª´ng phase."""
    total = TIMING_STATS['feature_compute'] + TIMING_STATS['tfidf_compute'] + TIMING_STATS['model_predict']
    n_queries = max(TIMING_STATS['total_queries'], 1)
    
    print("\n" + "="*50)
    print("‚è±Ô∏è  TIMING BREAKDOWN")
    print("="*50)
    print(f"{'Phase':<25} | {'Total (s)':<12} | {'Avg/Query (ms)':<15} | {'%':<8}")
    print("-"*65)
    
    for phase, label in [
        ('feature_compute', 'Feature Extraction'),
        ('tfidf_compute', 'TF-IDF Cosine'),
        ('model_predict', 'Model Prediction')
    ]:
        t = TIMING_STATS[phase]
        pct = (t / total * 100) if total > 0 else 0
        avg_ms = (t / n_queries) * 1000
        print(f"{label:<25} | {t:>10.2f}s | {avg_ms:>12.2f}ms | {pct:>6.1f}%")
    
    print("-"*65)
    print(f"{'TOTAL':<25} | {total:>10.2f}s | {(total/n_queries)*1000:>12.2f}ms | 100.0%")
    print(f"\nüìä Processed {n_queries} queries")
    print("="*50)


def reset_timing_stats():
    """Reset timing stats."""
    global TIMING_STATS
    TIMING_STATS = {
        'feature_compute': 0.0,
        'tfidf_compute': 0.0,
        'model_predict': 0.0,
        'total_queries': 0
    }


print("‚úÖ SUPER Optimized functions ready (TF-IDF pre-computed per paper).")

‚úÖ SUPER Optimized functions ready (TF-IDF pre-computed per paper).


In [4]:
# --- NEW HELPER FUNCTIONS: Load refs.bib + references.json ---
import bibtexparser

def load_queries_from_bib(bib_path):
    """
    ƒê·ªçc refs.bib v√† tr·∫£ v·ªÅ list queries (TARGET entries c·∫ßn match).
    
    Returns:
        List[dict]: M·ªói query c√≥: key, bib_title, bib_authors, bib_id, bib_year
    """
    if not os.path.exists(bib_path):
        return []
    
    try:
        with open(bib_path, 'r', encoding='utf-8') as f:
            bib_content = f.read()
        
        parser = bibtexparser.bparser.BibTexParser(common_strings=True)
        parser.ignore_nonstandard_types = True
        bib_db = bibtexparser.loads(bib_content, parser=parser)
        
        queries = []
        for entry in bib_db.entries:
            # L·∫•y raw text ƒë·ªÉ parse
            raw_text = entry.get('ENTRYTYPE', '') + '{'
            for k, v in entry.items():
                if k not in ['ENTRYTYPE', 'ID']:
                    raw_text += f'{k}={{{v}}}, '
            raw_text += '}'
            
            # Parse b·∫±ng h√†m smart parser
            parsed = parse_bibtex_smart(raw_text)
            
            queries.append({
                'key': entry.get('ID', ''),
                'bib_title': parsed.get('title', ''),
                'bib_authors': parsed.get('authors', []),
                'bib_id': parsed.get('extracted_id', ''),
                'bib_year': parsed.get('year', '')
            })
        
        return queries
    except Exception as e:
        print(f"‚ö†Ô∏è Error parsing {bib_path}: {e}")
        return []


def load_candidates_from_references(ref_json_path):
    """
    ƒê·ªçc references.json v√† tr·∫£ v·ªÅ list candidates (CANDIDATES pool t·ª´ arXiv API).
    
    Returns:
        List[dict]: M·ªói candidate c√≥: cand_id, cand_title, cand_authors, cand_year
    """
    if not os.path.exists(ref_json_path):
        return []
    
    try:
        ref_data = load_json(ref_json_path)
        candidates = []
        
        for arxiv_id, meta in ref_data.items():
            authors = meta.get('authors', [])
            date = meta.get('submission_date', '')
            year = str(date)[:4] if date and len(str(date)) >= 4 else ''
            
            candidates.append({
                'cand_id': arxiv_id,
                'cand_title': meta.get('title', '').lower().strip(),
                'cand_authors': [str(a).lower().strip() for a in authors],
                'cand_year': year
            })
        
        return candidates
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading {ref_json_path}: {e}")
        return []


def extract_groundtruth_from_labels(labels_data):
    """
    Tr√≠ch xu·∫•t groundtruth t·ª´ labels.json.
    
    Returns:
        dict: {paper_id: {bib_key: true_arxiv_id}}
    """
    from collections import defaultdict
    gt_map = defaultdict(dict)
    
    for item in labels_data:
        paper_id = item.get('source_paper_id')
        key = item.get('key')
        true_id = item.get('ground_truth', {}).get('id')
        
        if paper_id and key and true_id:
            gt_map[paper_id][key] = true_id
    
    return dict(gt_map)


print("‚úÖ Helper functions for refs.bib + references.json loaded!")


‚úÖ Helper functions for refs.bib + references.json loaded!


In [5]:
# --- LOAD MODEL ---
try:
    model = joblib.load(MODEL_PATH)
    feature_names = joblib.load(FEATURE_NAME_PATH)
    print(f"‚úÖ Model loaded: {type(model).__name__}")
    print(f"üìù Expected features ({len(feature_names)}): {feature_names}")
    
    # Verify features match SELECTED_FEATURES
    if set(feature_names) == set(SELECTED_FEATURES):
        print("‚úÖ Features kh·ªõp v·ªõi SELECTED_FEATURES")
    else:
        print("‚ö†Ô∏è WARNING: Features kh√¥ng kh·ªõp! S·ª≠ d·ª•ng SELECTED_FEATURES")
        feature_names = SELECTED_FEATURES
        
except FileNotFoundError:
    print("‚ùå CRITICAL ERROR: Model file not found. Cannot proceed.")
    model = None
    feature_names = SELECTED_FEATURES

‚úÖ Model loaded: Pipeline
üìù Expected features (6): ['feat_title_tfidf_cosine', 'feat_title_len_diff', 'feat_auth_jaccard', 'feat_year_match', 'feat_id_match', 'feat_first_auth_match']
‚úÖ Features kh·ªõp v·ªõi SELECTED_FEATURES



## Main Inference Loop
V√≤ng l·∫∑p n√†y s·∫Ω:
1. Duy·ªát qua t·ª´ng Partition (Train -> Valid -> Test).
2. Load `labels.json` c·ªßa partition ƒë√≥.
3. Transform sang c·∫•u tr√∫c Paper-based.
4. Ch·∫°y Ranking Model.
5. L∆∞u file `pred.json` v√†o folder t∆∞∆°ng ·ª©ng v·ªõi Paper ID.


## üîç Test & Validation
Ki·ªÉm tra logic tr∆∞·ªõc khi ch·∫°y full inference:
- Load sample papers t·ª´ labels.json
- ƒê·ªçc refs.bib v√† references.json
- Hi·ªÉn th·ªã pairs m·∫´u

In [6]:
# TEST: Load sample data t·ª´ partition ƒë·∫ßu ti√™n
print("üîç TESTING DATA LOADING...\n")

test_partition = 'train'
label_file = os.path.join(DATASET_ROOT, test_partition, 'labels.json')

if os.path.exists(label_file):
    # 1. Load labels
    labels_data = load_json(label_file)
    print(f"‚úÖ Loaded {len(labels_data)} entries from labels.json")
    
    # 2. Extract groundtruth
    groundtruth_map = extract_groundtruth_from_labels(labels_data)
    paper_ids = list(groundtruth_map.keys())
    
    print(f"‚úÖ Extracted {len(paper_ids)} unique papers")
    print(f"   Sample paper IDs: {paper_ids[:3]}\n")
    
    # 3. Test v·ªõi paper ƒë·∫ßu ti√™n
    test_paper_id = paper_ids[0]
    print(f"üìÑ Testing with paper: {test_paper_id}")
    print(f"   Groundtruth entries: {len(groundtruth_map[test_paper_id])}")
    print(f"   Sample groundtruth: {dict(list(groundtruth_map[test_paper_id].items())[:3])}\n")
    
    # 4. Load refs.bib
    paper_dir = os.path.join(DATA_FOLDER, test_paper_id)
    bib_path = os.path.join(paper_dir, 'refs.bib')
    
    if os.path.exists(bib_path):
        queries = load_queries_from_bib(bib_path)
        print(f"‚úÖ Loaded {len(queries)} queries from refs.bib")
        
        if queries:
            print(f"\nüìã Sample Query (TARGET entry c·∫ßn match):")
            sample_q = queries[0]
            print(f"   Key: {sample_q['key']}")
            print(f"   Title: {sample_q['bib_title'][:100]}...")
            print(f"   Authors: {sample_q['bib_authors'][:3]}")
            print(f"   Year: {sample_q['bib_year']}")
            print(f"   ID: {sample_q['bib_id']}")
    else:
        print(f"‚ö†Ô∏è refs.bib not found at {bib_path}")
    
    # 5. Load references.json
    ref_json_path = os.path.join(paper_dir, 'references.json')
    
    if os.path.exists(ref_json_path):
        candidates = load_candidates_from_references(ref_json_path)
        print(f"\n‚úÖ Loaded {len(candidates)} candidates from references.json")
        
        if candidates:
            print(f"\nüéØ Sample Candidate (t·ª´ arXiv API):")
            sample_c = candidates[0]
            print(f"   arXiv ID: {sample_c['cand_id']}")
            print(f"   Title: {sample_c['cand_title'][:100]}...")
            print(f"   Authors: {sample_c['cand_authors'][:3]}")
            print(f"   Year: {sample_c['cand_year']}")
    else:
        print(f"‚ö†Ô∏è references.json not found at {ref_json_path}")
    
    # 6. Show pairing example
    if queries and candidates:
        print(f"\n" + "="*60)
        print(f"üìä PAIRING STATISTICS")
        print(f"="*60)
        print(f"Total queries (refs.bib): {len(queries)}")
        print(f"Total candidates (references.json): {len(candidates)}")
        print(f"Total pairs to rank: {len(queries)} √ó {len(candidates)} = {len(queries) * len(candidates)}")
        print(f"\nTop-5 predictions per query ‚Üí {len(queries)} √ó 5 = {len(queries) * 5} predictions")
        print("="*60)
        
        # Show matching example
        print(f"\nüîó Example Pairing:")
        print(f"Query Key: {queries[0]['key']}")
        print(f"   ‚Üí Groundtruth: {groundtruth_map[test_paper_id].get(queries[0]['key'], 'N/A')}")
        print(f"   ‚Üí Will rank against {len(candidates)} candidates")
        print(f"   ‚Üí Return top-5 arXiv IDs")
        
else:
    print(f"‚ùå Label file not found: {label_file}")

print("\n‚úÖ Test completed! Ready to run full inference.")


üîç TESTING DATA LOADING...

‚úÖ Loaded 13348 entries from labels.json
‚úÖ Extracted 604 unique papers
   Sample paper IDs: ['2403-04225', '2403-03951', '2403-01980']

üìÑ Testing with paper: 2403-04225
   Groundtruth entries: 48
   Sample groundtruth: {'ref_0': '2305-16213', 'ref_1': '2209-14988', 'ref_2': '2211-10440'}

‚úÖ Loaded 54 queries from refs.bib

üìã Sample Query (TARGET entry c·∫ßn match):
   Key: ref_0
   Title: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation...
   Authors: ['Wang, Zhengyi', 'Lu, Cheng', 'Wang, Yikai']
   Year: 2024
   ID: 

‚úÖ Loaded 48 candidates from references.json

üéØ Sample Candidate (t·ª´ arXiv API):
   arXiv ID: 2308-10490
   Title: texture generation on 3d meshes with point-uv diffusion...
   Authors: ['xin yu', 'peng dai', 'wenbo li']
   Year: 

üìä PAIRING STATISTICS
Total queries (refs.bib): 54
Total candidates (references.json): 48
Total pairs to rank: 54 √ó 48 = 2592

Top-5 predicti

In [7]:
# Bi·∫øn ƒë·ªÉ th·ªëng k√™ t·ªïng k·∫øt
summary_stats = {
    'train': {'papers': 0, 'mrr': 0, 'queries': 0},
    'validation': {'papers': 0, 'mrr': 0, 'queries': 0},
    'test': {'papers': 0, 'mrr': 0, 'queries': 0}
}

# Reset timing tr∆∞·ªõc khi ch·∫°y
reset_timing_stats()

print("üöÄ STARTING NEW INFERENCE PIPELINE (refs.bib + references.json)...\n")

for partition in PARTITIONS:
    print(f"üîµ Processing Partition: [{partition.upper()}]")
    partition_start = time.time()
    
    # 1. Load labels.json ƒë·ªÉ l·∫•y groundtruth v√† paper_ids
    t_load = time.time()
    label_file = os.path.join(DATASET_ROOT, partition, 'labels.json')
    if not os.path.exists(label_file):
        print(f"   ‚ö†Ô∏è Warning: File {label_file} not found. Skipping.")
        continue
        
    labels_data = load_json(label_file)
    if not labels_data:
        print("   ‚ö†Ô∏è Empty labels. Skipping.")
        continue
    print(f"   ‚è±Ô∏è Load labels: {time.time() - t_load:.2f}s")

    # 2. Extract groundtruth v√† paper_ids
    t_transform = time.time()
    groundtruth_map = extract_groundtruth_from_labels(labels_data)
    paper_ids = list(groundtruth_map.keys())
    
    summary_stats[partition]['papers'] = len(paper_ids)
    print(f"   ‚è±Ô∏è Extract groundtruth: {time.time() - t_transform:.2f}s")
    print(f"   ‚Ü≥ Found {len(paper_ids)} papers. Processing...")
    
    # 3. NEW: Loop qua t·ª´ng paper v√† load refs.bib + references.json
    part_mrr_sum = 0
    part_query_count = 0
    
    for paper_id in tqdm(paper_ids, desc=f"   Ranking {partition}"):
        # X√°c ƒë·ªãnh ƒë∆∞·ªùng d·∫´n paper folder trong data_output_v2
        paper_dir = os.path.join(DATA_FOLDER, paper_id)
        
        if not os.path.exists(paper_dir):
            print(f"   ‚ö†Ô∏è Paper folder not found: {paper_id}")
            continue
        
        # 4. Load QUERIES t·ª´ refs.bib (TARGET entries)
        bib_path = os.path.join(paper_dir, 'refs.bib')
        queries = load_queries_from_bib(bib_path)
        
        if not queries:
            print(f"   ‚ö†Ô∏è No queries in {paper_id}/refs.bib")
            continue
        
        # 5. Load CANDIDATES t·ª´ references.json (Ground truth pool)
        ref_json_path = os.path.join(paper_dir, 'references.json')
        candidates = load_candidates_from_references(ref_json_path)
        
        if not candidates:
            print(f"   ‚ö†Ô∏è No candidates in {paper_id}/references.json")
            continue
        
        # 6. Run ranking
        predictions = rank_paper_fast(queries, candidates, model, feature_names)
        
        # 7. Build pred.json output
        groundtruth = groundtruth_map.get(paper_id, {})
        
        pred_output = {
            "partition": partition,
            "groundtruth": groundtruth,
            "prediction": predictions
        }
        
        # 8. Calculate MRR
        for query in queries:
            bib_key = query['key']
            true_id = groundtruth.get(bib_key)
            
            if not true_id:
                continue
            
            top_5 = predictions.get(bib_key, [])
            
            if true_id in top_5:
                rank = top_5.index(true_id) + 1
                part_mrr_sum += 1.0 / rank
            
            part_query_count += 1
        
        # 9. Save pred.json
        safe_paper_id = str(paper_id).strip()
        output_dir = os.path.join(SUBMISSION_DIR, safe_paper_id)
        os.makedirs(output_dir, exist_ok=True)
        
        save_path = os.path.join(output_dir, 'pred.json')
        save_json(pred_output, save_path)
    
    # Update Stats
    if part_query_count > 0:
        summary_stats[partition]['mrr'] = part_mrr_sum / part_query_count
        summary_stats[partition]['queries'] = part_query_count
    
    print(f"   ‚úÖ Finished {partition}. MRR: {summary_stats[partition]['mrr']:.4f}")
    print(f"   ‚è±Ô∏è Partition total: {time.time() - partition_start:.2f}s")
    print("-" * 50)

# In timing breakdown sau khi ch·∫°y xong
print_timing_stats()


üöÄ STARTING NEW INFERENCE PIPELINE (refs.bib + references.json)...

üîµ Processing Partition: [TRAIN]
   ‚è±Ô∏è Load labels: 0.14s
   ‚è±Ô∏è Extract groundtruth: 0.01s
   ‚Ü≥ Found 604 papers. Processing...


   Ranking train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 604/604 [1:08:24<00:00,  6.80s/it]


   ‚úÖ Finished train. MRR: 0.9615
   ‚è±Ô∏è Partition total: 4104.91s
--------------------------------------------------
üîµ Processing Partition: [VALIDATION]
   ‚è±Ô∏è Load labels: 0.04s
   ‚è±Ô∏è Extract groundtruth: 0.01s
   ‚Ü≥ Found 2 papers. Processing...


   Ranking validation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:41<00:00, 20.54s/it]


   ‚úÖ Finished validation. MRR: 0.9657
   ‚è±Ô∏è Partition total: 41.13s
--------------------------------------------------
üîµ Processing Partition: [TEST]
   ‚è±Ô∏è Load labels: 0.02s
   ‚è±Ô∏è Extract groundtruth: 0.00s
   ‚Ü≥ Found 2 papers. Processing...


   Ranking test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:19<00:00,  9.93s/it]

   ‚úÖ Finished test. MRR: 0.9324
   ‚è±Ô∏è Partition total: 19.90s
--------------------------------------------------

‚è±Ô∏è  TIMING BREAKDOWN
Phase                     | Total (s)    | Avg/Query (ms)  | %       
-----------------------------------------------------------------
Feature Extraction        |    3823.56s |       117.70ms |   98.8%
TF-IDF Cosine             |      15.27s |         0.47ms |    0.4%
Model Prediction          |      31.21s |         0.96ms |    0.8%
-----------------------------------------------------------------
TOTAL                     |    3870.04s |       119.13ms | 100.0%

üìä Processed 32485 queries





In [8]:
# --- FINAL REPORT ---
print("\n" + "="*40)
print("üìä SUBMISSION GENERATION REPORT")
print("="*40)
print(f"{'Partition':<15} | {'Papers':<10} | {'Queries':<10} | {'MRR':<10}")
print("-" * 50)

for part in PARTITIONS:
    stats = summary_stats[part]
    print(f"{part.upper():<15} | {stats['papers']:<10} | {stats['queries']:<10} | {stats['mrr']:.4f}")

print("="*40)
print(f"\nüìÅ All pred.json files are ready at: {os.path.abspath(SUBMISSION_DIR)}")


üìä SUBMISSION GENERATION REPORT
Partition       | Papers     | Queries    | MRR       
--------------------------------------------------
TRAIN           | 604        | 13339      | 0.9615
VALIDATION      | 2          | 107        | 0.9657
TEST            | 2          | 85         | 0.9324

üìÅ All pred.json files are ready at: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011\notebooks\submission_final\23127011


In [11]:
# Bi·∫øn ƒë·ªÉ th·ªëng k√™ t·ªïng k·∫øt
PARTITIONS = ['validation', 'test']
summary_stats = {
    'validation': {'papers': 0, 'mrr': 0, 'queries': 0},
    'test': {'papers': 0, 'mrr': 0, 'queries': 0}
}

# Reset timing tr∆∞·ªõc khi ch·∫°y
reset_timing_stats()

print("üöÄ STARTING NEW INFERENCE PIPELINE (refs.bib + references.json)...\n")

for partition in PARTITIONS:
    print(f"üîµ Processing Partition: [{partition.upper()}]")
    partition_start = time.time()
    
    # 1. Load labels.json ƒë·ªÉ l·∫•y groundtruth v√† paper_ids
    t_load = time.time()
    label_file = os.path.join(DATASET_ROOT, partition, 'labels.json')
    if not os.path.exists(label_file):
        print(f"   ‚ö†Ô∏è Warning: File {label_file} not found. Skipping.")
        continue
        
    labels_data = load_json(label_file)
    if not labels_data:
        print("   ‚ö†Ô∏è Empty labels. Skipping.")
        continue
    print(f"   ‚è±Ô∏è Load labels: {time.time() - t_load:.2f}s")

    # 2. Extract groundtruth v√† paper_ids
    t_transform = time.time()
    groundtruth_map = extract_groundtruth_from_labels(labels_data)
    paper_ids = list(groundtruth_map.keys())
    
    summary_stats[partition]['papers'] = len(paper_ids)
    print(f"   ‚è±Ô∏è Extract groundtruth: {time.time() - t_transform:.2f}s")
    print(f"   ‚Ü≥ Found {len(paper_ids)} papers. Processing...")
    
    # 3. NEW: Loop qua t·ª´ng paper v√† load refs.bib + references.json
    part_mrr_sum = 0
    part_query_count = 0
    
    for paper_id in tqdm(paper_ids, desc=f"   Ranking {partition}"):
        # X√°c ƒë·ªãnh ƒë∆∞·ªùng d·∫´n paper folder trong data_output_v2
        paper_dir = os.path.join(DATA_FOLDER, paper_id)
        
        if not os.path.exists(paper_dir):
            print(f"   ‚ö†Ô∏è Paper folder not found: {paper_id}")
            continue
        
        # 4. Load QUERIES t·ª´ refs.bib (TARGET entries)
        bib_path = os.path.join(paper_dir, 'refs.bib')
        queries = load_queries_from_bib(bib_path)
        
        if not queries:
            print(f"   ‚ö†Ô∏è No queries in {paper_id}/refs.bib")
            continue
        
        # 5. Load CANDIDATES t·ª´ references.json (Ground truth pool)
        ref_json_path = os.path.join(paper_dir, 'references.json')
        candidates = load_candidates_from_references(ref_json_path)
        
        if not candidates:
            print(f"   ‚ö†Ô∏è No candidates in {paper_id}/references.json")
            continue
        
        # 6. Run ranking
        predictions = rank_paper_fast(queries, candidates, model, feature_names)
        
        # 7. Build pred.json output
        groundtruth = groundtruth_map.get(paper_id, {})
        
        pred_output = {
            "partition": partition,
            "groundtruth": groundtruth,
            "prediction": predictions
        }
        
        # 8. Calculate MRR
        for query in queries:
            bib_key = query['key']
            true_id = groundtruth.get(bib_key)
            
            if not true_id:
                print("skip at partition ", partition, bib_key)
                continue
            
            top_5 = predictions.get(bib_key, [])
            
            if true_id in top_5:
                rank = top_5.index(true_id) + 1
                part_mrr_sum += 1.0 / rank
            
            part_query_count += 1
        
        # 9. Save pred.json
        safe_paper_id = str(paper_id).strip()
        output_dir = os.path.join(SUBMISSION_DIR, safe_paper_id)
        os.makedirs(output_dir, exist_ok=True)
        
        save_path = os.path.join(output_dir, 'pred.json')
        save_json(pred_output, save_path)
    
    # Update Stats
    if part_query_count > 0:
        summary_stats[partition]['mrr'] = part_mrr_sum / part_query_count
        summary_stats[partition]['queries'] = part_query_count
    
    print(f"   ‚úÖ Finished {partition}. MRR: {summary_stats[partition]['mrr']:.4f}")
    print(f"   ‚è±Ô∏è Partition total: {time.time() - partition_start:.2f}s")
    print("-" * 50)

# In timing breakdown sau khi ch·∫°y xong
print_timing_stats()


üöÄ STARTING NEW INFERENCE PIPELINE (refs.bib + references.json)...

üîµ Processing Partition: [VALIDATION]
   ‚è±Ô∏è Load labels: 0.00s
   ‚è±Ô∏è Extract groundtruth: 0.00s
   ‚Ü≥ Found 2 papers. Processing...


   Ranking validation:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [00:10<00:10, 10.19s/it]

skip at partition  validation ref_6
skip at partition  validation ref_8
skip at partition  validation ref_13
skip at partition  validation ref_14
skip at partition  validation ref_17
skip at partition  validation ref_22
skip at partition  validation ref_33
skip at partition  validation ref_35
skip at partition  validation ref_42
skip at partition  validation ref_48
skip at partition  validation ref_49
skip at partition  validation ref_50
skip at partition  validation ref_51
skip at partition  validation ref_52
skip at partition  validation ref_53
skip at partition  validation ref_60
skip at partition  validation ref_61
skip at partition  validation ref_62
skip at partition  validation ref_63
skip at partition  validation ref_75
skip at partition  validation ref_88


   Ranking validation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:15<00:00,  7.69s/it]


skip at partition  validation ref_0
skip at partition  validation ref_1
skip at partition  validation ref_2
skip at partition  validation ref_3
skip at partition  validation ref_4
skip at partition  validation ref_5
skip at partition  validation ref_7
skip at partition  validation ref_8
skip at partition  validation ref_9
skip at partition  validation ref_11
skip at partition  validation ref_12
skip at partition  validation ref_13
skip at partition  validation ref_14
skip at partition  validation ref_15
skip at partition  validation ref_16
skip at partition  validation ref_17
skip at partition  validation ref_18
skip at partition  validation ref_21
skip at partition  validation ref_22
skip at partition  validation ref_23
skip at partition  validation ref_24
skip at partition  validation ref_31
skip at partition  validation ref_34
skip at partition  validation ref_36
skip at partition  validation ref_43
skip at partition  validation ref_46
skip at partition  validation ref_50
skip at pa

   Ranking test:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [00:01<00:01,  1.23s/it]

skip at partition  test ref_9
skip at partition  test ref_10
skip at partition  test ref_11
skip at partition  test ref_13
skip at partition  test ref_14
skip at partition  test ref_15
skip at partition  test ref_17
skip at partition  test ref_26
skip at partition  test ref_29
skip at partition  test ref_30
skip at partition  test ref_32


   Ranking test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:06<00:00,  3.20s/it]

skip at partition  test ref_27
skip at partition  test ref_36
skip at partition  test ref_45
skip at partition  test ref_65
skip at partition  test ref_67
   ‚úÖ Finished test. MRR: 0.9324
   ‚è±Ô∏è Partition total: 6.41s
--------------------------------------------------

‚è±Ô∏è  TIMING BREAKDOWN
Phase                     | Total (s)    | Avg/Query (ms)  | %       
-----------------------------------------------------------------
Feature Extraction        |      20.67s |        76.86ms |   99.1%
TF-IDF Cosine             |       0.06s |         0.24ms |    0.3%
Model Prediction          |       0.13s |         0.48ms |    0.6%
-----------------------------------------------------------------
TOTAL                     |      20.87s |        77.58ms | 100.0%

üìä Processed 269 queries



