# Notebook 06: Full Inference for Submission
**M·ª•c ti√™u:**
1. Ch·∫°y pipeline d·ª± ƒëo√°n tr√™n **T·∫§T C·∫¢** c√°c t·∫≠p d·ªØ li·ªáu: Train, Validation, v√† Test.
2. Sinh ra file `pred.json` cho t·ª´ng b√†i b√°o.
3. S·∫Øp x·∫øp k·∫øt qu·∫£ v√†o ƒë√∫ng c·∫•u tr√∫c th∆∞ m·ª•c ƒë·ªÉ n·ªôp b√†i (Submission Ready).

**Model:** Decision Tree/XGBoost v·ªõi **6 selected features**

**C·∫•u tr√∫c Output:**
```bash
submission_final/
‚îú‚îÄ‚îÄ <Student_ID>/
‚îÇ   ‚îú‚îÄ‚îÄ <paper_id_1>/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ pred.json
‚îÇ   ‚îú‚îÄ‚îÄ <paper_id_2>/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ pred.json
...
```

In [17]:
import pandas as pd
import numpy as np
import json
import os
import sys
import joblib
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Setup path ƒë·ªÉ import src module
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

# Import t·ª´ src.ml module (S·ª≠ d·ª•ng l·∫°i c√°c h√†m ƒë√£ refactor)
from src.ml import (
    load_json,
    save_json,
    normalize_text_basic,
    compute_pairwise_features,
    compute_tfidf_cosine_single,
    parse_bibtex_smart,
    transform_to_paper_based
)

pd.set_option('display.max_columns', None)
print("‚úÖ Libraries v√† src.ml modules imported successfully!")

# --- DANH S√ÅCH 6 FEATURES ƒê√É CH·ªåN (ph·∫£i kh·ªõp v·ªõi Notebook 03/04) ---
SELECTED_FEATURES = [
    'feat_title_tfidf_cosine',
    'feat_title_len_diff',
    'feat_auth_jaccard',
    'feat_year_match',
    'feat_id_match',
    'feat_first_auth_match',
]
print(f"üìù S·ª≠ d·ª•ng {len(SELECTED_FEATURES)} selected features")

‚úÖ Libraries v√† src.ml modules imported successfully!
üìù S·ª≠ d·ª•ng 6 selected features


In [18]:
# --- C·∫§U H√åNH QUAN TR·ªåNG ---

# 1. Th√¥ng tin sinh vi√™n (ƒê·ªÉ t·∫°o folder n·ªôp b√†i)
STUDENT_ID = "23127011" 

# 2. ƒê∆∞·ªùng d·∫´n d·ªØ li·ªáu
DATASET_ROOT = '../../dataset_final'
PARTITIONS = ['train', 'validation', 'test']

# 3. ƒê∆∞·ªùng d·∫´n Model
MODEL_PATH = '../../dataset_final/models/best_matcher.pkl'
FEATURE_NAME_PATH = '../../dataset_final/models/feature_names.pkl'

# 4. Th∆∞ m·ª•c Output
SUBMISSION_DIR = f'submission_final/{STUDENT_ID}'

if not os.path.exists(SUBMISSION_DIR):
    os.makedirs(SUBMISSION_DIR)
    print(f"üìÅ Created submission directory: {os.path.abspath(SUBMISSION_DIR)}")
else:
    print(f"üìÅ Saving to existing directory: {os.path.abspath(SUBMISSION_DIR)}")


üìÅ Saving to existing directory: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011\notebooks\submission_final\23127011


In [19]:
# --- OPTIMIZED HELPER FUNCTIONS WITH TIMING ---
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import time

# Global timing stats
TIMING_STATS = {
    'feature_compute': 0.0,
    'tfidf_compute': 0.0,
    'model_predict': 0.0,
    'total_queries': 0
}

def batch_compute_features(pairs_list):
    """Compute features cho batch pairs - VECTORIZED."""
    if not pairs_list:
        return []
    
    results = []
    for row in pairs_list:
        feats = compute_pairwise_features(row)
        results.append(feats)
    return results


def rank_paper_fast(queries, candidates_list, model, feature_names):
    """
    Ranking t·∫•t c·∫£ queries c·ªßa 1 paper c√πng l√∫c - SUPER OPTIMIZED.
    
    KEY OPTIMIZATION: 
    - Pre-compute TF-IDF vectors cho candidates 1 L·∫¶N
    - Reuse cho t·∫•t c·∫£ queries
    """
    global TIMING_STATS
    
    if not candidates_list:
        return {}
    
    predictions = {}
    
    # Pre-normalize titles 1 l·∫ßn
    cand_titles = [normalize_text_basic(c.get('cand_title', '')) for c in candidates_list]
    cand_ids = [c['cand_id'] for c in candidates_list]
    query_titles = [normalize_text_basic(q.get('bib_title', '')) for q in queries]
    
    # === PRE-COMPUTE TF-IDF CHO TO√ÄN B·ªò PAPER (1 L·∫¶N DUY NH·∫§T) ===
    t_tfidf_start = time.time()
    
    # G·ªôp t·∫•t c·∫£ unique titles (queries + candidates)
    all_titles = list(set(query_titles + cand_titles))
    
    try:
        # Fit vectorizer 1 l·∫ßn duy nh·∫•t cho paper n√†y
        vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 4), min_df=1)
        vectorizer.fit(all_titles)
        
        # Transform candidates 1 l·∫ßn
        cand_vecs = vectorizer.transform(cand_titles)
        
        # Transform queries 1 l·∫ßn  
        query_vecs = vectorizer.transform(query_titles)
        
        # Pre-compute t·∫•t c·∫£ TF-IDF scores (queries x candidates)
        # Shape: (n_queries, n_candidates)
        tfidf_matrix = cosine_similarity(query_vecs, cand_vecs)
        
    except Exception as e:
        # Fallback n·∫øu l·ªói
        tfidf_matrix = np.zeros((len(queries), len(candidates_list)))
    
    TIMING_STATS['tfidf_compute'] += time.time() - t_tfidf_start
    
    # === PROCESS T·ª™NG QUERY (ƒë√£ c√≥ TF-IDF s·∫µn) ===
    for q_idx, query in enumerate(queries):
        bib_key = query['key']
        
        # T·∫°o pairs
        pairs = []
        for cand in candidates_list:
            row = {}
            row.update(query)
            row.update(cand)
            pairs.append(row)
        
        # --- PHASE 1: Feature Compute (KH√îNG c√≥ TF-IDF) ---
        t0 = time.time()
        feats_list = batch_compute_features(pairs)
        TIMING_STATS['feature_compute'] += time.time() - t0
        
        # Add TF-IDF t·ª´ pre-computed matrix
        for c_idx, feats in enumerate(feats_list):
            feats['feat_title_tfidf_cosine'] = tfidf_matrix[q_idx, c_idx]
        
        # Create DataFrame v√† predict
        df_feats = pd.DataFrame(feats_list)
        
        # Fill missing cols
        for col in feature_names:
            if col not in df_feats.columns:
                df_feats[col] = 0.0
        
        X_input = df_feats[feature_names].values
        
        # --- PHASE 2: Model Predict ---
        t2 = time.time()
        if model:
            scores = model.predict_proba(X_input)[:, 1]
        else:
            scores = np.random.rand(len(pairs))
        TIMING_STATS['model_predict'] += time.time() - t2
        
        # Ranking
        ranked_idx = np.argsort(scores)[::-1][:5]
        top_5 = [cand_ids[i] for i in ranked_idx]
        
        predictions[bib_key] = top_5
        TIMING_STATS['total_queries'] += 1
    
    return predictions


def print_timing_stats():
    """In th·ªëng k√™ th·ªùi gian c·ªßa t·ª´ng phase."""
    total = TIMING_STATS['feature_compute'] + TIMING_STATS['tfidf_compute'] + TIMING_STATS['model_predict']
    n_queries = max(TIMING_STATS['total_queries'], 1)
    
    print("\n" + "="*50)
    print("‚è±Ô∏è  TIMING BREAKDOWN")
    print("="*50)
    print(f"{'Phase':<25} | {'Total (s)':<12} | {'Avg/Query (ms)':<15} | {'%':<8}")
    print("-"*65)
    
    for phase, label in [
        ('feature_compute', 'Feature Extraction'),
        ('tfidf_compute', 'TF-IDF Cosine'),
        ('model_predict', 'Model Prediction')
    ]:
        t = TIMING_STATS[phase]
        pct = (t / total * 100) if total > 0 else 0
        avg_ms = (t / n_queries) * 1000
        print(f"{label:<25} | {t:>10.2f}s | {avg_ms:>12.2f}ms | {pct:>6.1f}%")
    
    print("-"*65)
    print(f"{'TOTAL':<25} | {total:>10.2f}s | {(total/n_queries)*1000:>12.2f}ms | 100.0%")
    print(f"\nüìä Processed {n_queries} queries")
    print("="*50)


def reset_timing_stats():
    """Reset timing stats."""
    global TIMING_STATS
    TIMING_STATS = {
        'feature_compute': 0.0,
        'tfidf_compute': 0.0,
        'model_predict': 0.0,
        'total_queries': 0
    }


print("‚úÖ SUPER Optimized functions ready (TF-IDF pre-computed per paper).")

‚úÖ SUPER Optimized functions ready (TF-IDF pre-computed per paper).


In [20]:
# --- LOAD MODEL ---
try:
    model = joblib.load(MODEL_PATH)
    feature_names = joblib.load(FEATURE_NAME_PATH)
    print(f"‚úÖ Model loaded: {type(model).__name__}")
    print(f"üìù Expected features ({len(feature_names)}): {feature_names}")
    
    # Verify features match SELECTED_FEATURES
    if set(feature_names) == set(SELECTED_FEATURES):
        print("‚úÖ Features kh·ªõp v·ªõi SELECTED_FEATURES")
    else:
        print("‚ö†Ô∏è WARNING: Features kh√¥ng kh·ªõp! S·ª≠ d·ª•ng SELECTED_FEATURES")
        feature_names = SELECTED_FEATURES
        
except FileNotFoundError:
    print("‚ùå CRITICAL ERROR: Model file not found. Cannot proceed.")
    model = None
    feature_names = SELECTED_FEATURES

‚úÖ Model loaded: Pipeline
üìù Expected features (6): ['feat_title_tfidf_cosine', 'feat_title_len_diff', 'feat_auth_jaccard', 'feat_year_match', 'feat_id_match', 'feat_first_auth_match']
‚úÖ Features kh·ªõp v·ªõi SELECTED_FEATURES



## Main Inference Loop
V√≤ng l·∫∑p n√†y s·∫Ω:
1. Duy·ªát qua t·ª´ng Partition (Train -> Valid -> Test).
2. Load `labels.json` c·ªßa partition ƒë√≥.
3. Transform sang c·∫•u tr√∫c Paper-based.
4. Ch·∫°y Ranking Model.
5. L∆∞u file `pred.json` v√†o folder t∆∞∆°ng ·ª©ng v·ªõi Paper ID.


In [21]:
# Bi·∫øn ƒë·ªÉ th·ªëng k√™ t·ªïng k·∫øt
summary_stats = {
    'train': {'papers': 0, 'mrr': 0, 'queries': 0},
    'validation': {'papers': 0, 'mrr': 0, 'queries': 0},
    'test': {'papers': 0, 'mrr': 0, 'queries': 0}
}

# Reset timing tr∆∞·ªõc khi ch·∫°y
reset_timing_stats()

print("üöÄ STARTING OPTIMIZED INFERENCE PIPELINE...\n")

for partition in PARTITIONS:
    print(f"üîµ Processing Partition: [{partition.upper()}]")
    partition_start = time.time()
    
    # 1. Load Data
    t_load = time.time()
    label_file = os.path.join(DATASET_ROOT, partition, 'labels.json')
    if not os.path.exists(label_file):
        print(f"   ‚ö†Ô∏è Warning: File {label_file} not found. Skipping.")
        continue
        
    raw_data = load_json(label_file)
    if not raw_data:
        print("   ‚ö†Ô∏è Empty data. Skipping.")
        continue
    print(f"   ‚è±Ô∏è Load data: {time.time() - t_load:.2f}s")

    # 2. Transform Data
    t_transform = time.time()
    papers_db = transform_to_paper_based(raw_data, parse_bibtex_smart)
    summary_stats[partition]['papers'] = len(papers_db)
    print(f"   ‚è±Ô∏è Transform data: {time.time() - t_transform:.2f}s")
    
    print(f"   ‚Ü≥ Found {len(papers_db)} papers. Ranking now...")
    
    # 3. OPTIMIZED Ranking Loop
    part_mrr_sum = 0
    part_query_count = 0
    
    for paper_id, data in tqdm(papers_db.items(), desc=f"   Ranking {partition}"):
        queries = data['queries']
        candidates_dict = data['candidates']
        candidates_list = list(candidates_dict.values())
        
        if not candidates_list or not queries:
            continue
        
        # --- FAST RANKING ---
        predictions = rank_paper_fast(queries, candidates_list, model, feature_names)
        
        # Build Output
        pred_output = {
            "partition": partition,
            "groundtruth": {q['key']: q['true_id'] for q in queries},
            "prediction": predictions
        }
        
        # MRR Calculation
        for query in queries:
            bib_key = query['key']
            true_id = query['true_id']
            top_5 = predictions.get(bib_key, [])
            
            if true_id in top_5:
                rank = top_5.index(true_id) + 1
                part_mrr_sum += 1.0 / rank
            part_query_count += 1
        
        # --- SAVE pred.json ---
        safe_paper_id = str(paper_id).strip()
        paper_dir = os.path.join(SUBMISSION_DIR, safe_paper_id)
        os.makedirs(paper_dir, exist_ok=True)
        
        save_path = os.path.join(paper_dir, 'pred.json')
        save_json(pred_output, save_path)
    
    # Update Stats
    if part_query_count > 0:
        summary_stats[partition]['mrr'] = part_mrr_sum / part_query_count
        summary_stats[partition]['queries'] = part_query_count
    
    print(f"   ‚úÖ Finished {partition}. MRR: {summary_stats[partition]['mrr']:.4f}")
    print(f"   ‚è±Ô∏è Partition total: {time.time() - partition_start:.2f}s")
    print("-" * 50)

# In timing breakdown sau khi ch·∫°y xong
print_timing_stats()

üöÄ STARTING OPTIMIZED INFERENCE PIPELINE...

üîµ Processing Partition: [TRAIN]
   ‚è±Ô∏è Load data: 0.25s
   ‚è±Ô∏è Transform data: 290.45s
   ‚Ü≥ Found 605 papers. Ranking now...


   Ranking train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 605/605 [39:16<00:00,  3.90s/it]  


   ‚úÖ Finished train. MRR: 0.9973
   ‚è±Ô∏è Partition total: 2647.71s
--------------------------------------------------
üîµ Processing Partition: [VALIDATION]
   ‚è±Ô∏è Load data: 0.02s
   ‚è±Ô∏è Transform data: 1.12s
   ‚Ü≥ Found 1 papers. Ranking now...


   Ranking validation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:07<00:00,  7.19s/it]


   ‚úÖ Finished validation. MRR: 1.0000
   ‚è±Ô∏è Partition total: 8.33s
--------------------------------------------------
üîµ Processing Partition: [TEST]
   ‚è±Ô∏è Load data: 0.00s
   ‚è±Ô∏è Transform data: 0.14s
   ‚Ü≥ Found 1 papers. Ranking now...


   Ranking test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.40it/s]

   ‚úÖ Finished test. MRR: 1.0000
   ‚è±Ô∏è Partition total: 0.85s
--------------------------------------------------

‚è±Ô∏è  TIMING BREAKDOWN
Phase                     | Total (s)    | Avg/Query (ms)  | %       
-----------------------------------------------------------------
Feature Extraction        |    2310.78s |       171.91ms |   99.0%
TF-IDF Cosine             |      10.90s |         0.81ms |    0.5%
Model Prediction          |      12.61s |         0.94ms |    0.5%
-----------------------------------------------------------------
TOTAL                     |    2334.29s |       173.66ms | 100.0%

üìä Processed 13442 queries





In [22]:
# --- FINAL REPORT ---
print("\n" + "="*40)
print("üìä SUBMISSION GENERATION REPORT")
print("="*40)
print(f"{'Partition':<15} | {'Papers':<10} | {'Queries':<10} | {'MRR':<10}")
print("-" * 50)

for part in PARTITIONS:
    stats = summary_stats[part]
    print(f"{part.upper():<15} | {stats['papers']:<10} | {stats['queries']:<10} | {stats['mrr']:.4f}")

print("="*40)
print(f"\nüìÅ All pred.json files are ready at: {os.path.abspath(SUBMISSION_DIR)}")


üìä SUBMISSION GENERATION REPORT
Partition       | Papers     | Queries    | MRR       
--------------------------------------------------
TRAIN           | 605        | 13348      | 0.9973
VALIDATION      | 1          | 72         | 1.0000
TEST            | 1          | 22         | 1.0000

üìÅ All pred.json files are ready at: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011\notebooks\submission_final\23127011
