# Milestone 2: Model Evaluation & Ranking Generation
**M·ª•c ti√™u:**
1. ƒê·ªçc d·ªØ li·ªáu `labels.json` (ch·ª©a input BibTeX v√† Ground Truth).
2. T√°i c·∫•u tr√∫c d·ªØ li·ªáu theo t·ª´ng b√†i b√°o (Paper-based grouping).
3. S·ª≠ d·ª•ng m√¥ h√¨nh ƒë√£ hu·∫•n luy·ªán ƒë·ªÉ x·∫øp h·∫°ng (Ranking) c√°c reference candidates.
4. T√≠nh ch·ªâ s·ªë **MRR (Mean Reciprocal Rank)**.
5. Xu·∫•t file `pred.json` theo ƒë√∫ng ƒë·ªãnh d·∫°ng y√™u c·∫ßu n·ªôp b√†i.

**Y√™u c·∫ßu ƒë·∫ßu v√†o:**
* `labels.json`: File d·ªØ li·ªáu test.
* `models/best_matcher.pkl`: Model ƒë√£ train.
* `models/feature_names.pkl`: List t√™n features t∆∞∆°ng ·ª©ng.


In [None]:
import pandas as pd
import numpy as np
import json
import os
import sys
import joblib
from tqdm import tqdm
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Setup path ƒë·ªÉ import src module
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

# Import t·ª´ src.ml module
from src.ml import (
    load_json,
    save_json,
    normalize_text_basic,
    get_tokens,
    safe_year_diff,
    compute_pairwise_features,
    compute_tfidf_cosine_single,
    parse_bibtex_smart,
    clean_latex,
    normalize_id,
    transform_to_paper_based
)

pd.set_option('display.max_columns', None)
print("‚úÖ Libraries v√† src.ml modules imported successfully!")

Libraries imported successfully!


In [65]:
# --- C·∫§U H√åNH ƒê∆Ø·ªúNG D·∫™N ---
# H√£y thay ƒë·ªïi ƒë∆∞·ªùng d·∫´n ph√π h·ª£p v·ªõi m√¥i tr∆∞·ªùng c·ªßa b·∫°n
TEST_FILE_PATH = '../../dataset_final/test/labels.json'           # File d·ªØ li·ªáu test
MODEL_PATH = '../../dataset_final/models/best_matcher.pkl'   # File model (ƒë√£ train ·ªü b∆∞·ªõc tr∆∞·ªõc)
FEATURE_NAME_PATH = '../../dataset_final/models/feature_names.pkl' # Danh s√°ch features
OUTPUT_DIR = 'submission_output'         # Th∆∞ m·ª•c ch·ª©a k·∫øt qu·∫£ pred.json

if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)
    print(f"Created output directory: {OUTPUT_DIR}")


In [None]:
# C√°c h√†m helper ƒë√£ ƒë∆∞·ª£c import t·ª´ src.ml:
# - normalize_text_basic(text) -> str
# - get_tokens(text_or_list) -> set
# - safe_year_diff(y1, y2) -> int

print("üìö Helper functions ƒë√£ import t·ª´ src.ml.features:")
print("   - normalize_text_basic")
print("   - get_tokens")  
print("   - safe_year_diff")

In [None]:
# BibTeX Parser ƒë√£ ƒë∆∞·ª£c import t·ª´ src.ml.bibtex_parser:
# - parse_bibtex_smart(bib_string) -> dict
# - clean_latex(text) -> str
# - normalize_id(text) -> str

print("üìö BibTeX Parser ƒë√£ import t·ª´ src.ml.bibtex_parser:")
print("   - parse_bibtex_smart: Parser V3 (Title, Authors, Year, ID)")
print("   - clean_latex: X·ª≠ l√Ω LaTeX markup")
print("   - normalize_id: Chu·∫©n h√≥a DOI/arXiv")

## 1. Feature Engineering
**L∆∞u √Ω quan tr·ªçng:** C√°c h√†m d∆∞·ªõi ƒë√¢y ph·∫£i **gi·ªëng h·ªát** logic b·∫°n ƒë√£ s·ª≠ d·ª•ng khi hu·∫•n luy·ªán m√¥ h√¨nh (Training Phase). N·∫øu thay ƒë·ªïi, model s·∫Ω nh·∫≠n di·ªán sai ƒë·∫∑c tr∆∞ng.


In [None]:
# Helper functions ƒë√£ import t·ª´ module
# Kh√¥ng c·∫ßn define l·∫°i

def normalize_text(text):
    """Wrapper for normalize_text_basic."""
    return normalize_text_basic(text)

In [None]:
from fuzzywuzzy import fuzz

# S·ª≠ d·ª•ng compute_pairwise_features t·ª´ module
# Th√™m TF-IDF cosine cho single pairs

def compute_features_with_tfidf(row):
    """Compute features bao g·ªìm TF-IDF cosine."""
    feats = compute_pairwise_features(row)
    
    # Th√™m TF-IDF cosine cho single pair
    q_tit = normalize_text_basic(row.get('bib_title', ''))
    c_tit = normalize_text_basic(row.get('cand_title', ''))
    feats['feat_title_tfidf_cosine'] = compute_tfidf_cosine_single(q_tit, c_tit)
    
    return feats

print("‚úÖ Feature engineering function ready (s·ª≠ d·ª•ng src.ml.features)")

## 2. Load & Transform Data
Ch√∫ng ta c·∫ßn chuy·ªÉn ƒë·ªïi d·ªØ li·ªáu t·ª´ danh s√°ch ph·∫≥ng (flat list) trong `manual.json` sang c·∫•u tr√∫c **Paper-based**.
* M·ªói `source_paper_id` s·∫Ω l√† m·ªôt nh√≥m.
* T·∫≠p **Candidates** c·ªßa nh√≥m ƒë√≥ l√† t·∫≠p h·ª£p t·∫•t c·∫£ c√°c `ground_truth` unique xu·∫•t hi·ªán trong paper.


In [None]:
# --- DATA LOADING & TRANSFORMATION ---
try:
    raw_data = load_json(TEST_FILE_PATH)
    print(f"‚úÖ Loaded {len(raw_data)} entries.")
except FileNotFoundError:
    print("‚ùå Error: File not found.")
    raw_data = []

# Transform sang Paper-based structure s·ª≠ d·ª•ng module
papers_db = transform_to_paper_based(raw_data, parse_bibtex_smart)

print(f"‚úÖ Data transformed into {len(papers_db)} papers.")

Loaded 812 entries.
‚úÖ Data transformed into 39 papers.


## 3. Load Trained Model
Load model SVM/RandomForest/XGBoost ƒë√£ l∆∞u t·ª´ b∆∞·ªõc Training.


In [71]:
try:
    model = joblib.load(MODEL_PATH)
    feature_names = joblib.load(FEATURE_NAME_PATH)
    print(f"‚úÖ Model loaded: {type(model).__name__}")
    print(f"üìù Expected features: {len(feature_names)}")
except FileNotFoundError:
    print("‚ö†Ô∏è WARNING: Model file not found. Code will run with RANDOM SCORES for demonstration.")
    model = None
    feature_names = []


‚úÖ Model loaded: XGBClassifier
üìù Expected features: 13


## 4. Ranking Pipeline & Prediction
V·ªõi m·ªói b√†i b√°o:
1. T·∫°o c·∫∑p (Query, Candidate) cho t·∫•t c·∫£ ·ª©ng vi√™n.
2. T√≠nh feature.
3. D·ª± ƒëo√°n x√°c su·∫•t match.
4. L·∫•y Top 5 ·ª©ng vi√™n c√≥ ƒëi·ªÉm cao nh·∫•t.


In [None]:
global_mrr_sum = 0
global_query_count = 0

print("üöÄ Starting Ranking Pipeline...")

for paper_id, data in papers_db.items():
    queries = data['queries']
    candidates_dict = data['candidates']
    candidates_list = list(candidates_dict.values())
    
    if not candidates_list:
        continue

    # Init Output Structure
    submission_data = {
        "partition": "test", 
        "groundtruth": {},
        "prediction": {}
    }
    
    # Loop qua t·ª´ng query trong paper
    for query in tqdm(queries, desc=f"Paper {paper_id}", leave=False):
        bib_key = query['key']
        true_id = query['true_id']
        
        # Groundtruth
        submission_data['groundtruth'][bib_key] = true_id
        
        # Pairing & Feature Calc
        pairs = []
        for cand in candidates_list:
            row = {}
            row.update(query)
            row.update(cand)
            pairs.append(row)
            
        # Compute Features s·ª≠ d·ª•ng h√†m t·ª´ module
        feats_list = [compute_features_with_tfidf(p) for p in pairs]
        df_feats = pd.DataFrame(feats_list)
        
        # Predict
        scores = []
        if model:
            # Ensure columns exist
            for col in feature_names:
                if col not in df_feats.columns: 
                    df_feats[col] = 0.0
            
            X_input = df_feats[feature_names]
            scores = model.predict_proba(X_input)[:, 1]
        else:
            scores = np.random.rand(len(pairs))
            
        # Ranking
        ranked_candidates = []
        for idx, score in enumerate(scores):
            ranked_candidates.append((candidates_list[idx]['cand_id'], score))
        
        ranked_candidates.sort(key=lambda x: x[1], reverse=True)
        top_5 = [x[0] for x in ranked_candidates[:5]]
        
        # Save Prediction
        submission_data['prediction'][bib_key] = top_5
        
        # Calc MRR
        if true_id in top_5:
            rank = top_5.index(true_id) + 1
            global_mrr_sum += 1.0 / rank
        else:
            global_mrr_sum += 0.0
            
        global_query_count += 1
        
    # Save JSON Output s·ª≠ d·ª•ng module
    safe_pid = str(paper_id).replace('/', '_')
    save_path = os.path.join(OUTPUT_DIR, f"{safe_pid}_pred.json")
    save_json(submission_data, save_path)

print("‚úÖ Ranking completed for all papers.")

üöÄ Starting Ranking Pipeline...


                                                                  

‚úÖ Ranking completed for all papers.




## 5. Evaluation Report (Nh·∫≠n x√©t)


In [73]:
final_mrr = global_mrr_sum / global_query_count if global_query_count > 0 else 0

print("="*40)
print("üìä FINAL EVALUATION REPORT")
print("="*40)
print(f"Papers Processed: {len(papers_db)}")
print(f"Total Queries:    {global_query_count}")
print(f"Metric MRR:       {final_mrr:.4f}")
print("="*40)

if final_mrr > 0.8:
    print("üåü Nh·∫≠n x√©t: Xu·∫•t s·∫Øc. Model x·∫øp h·∫°ng reference r·∫•t ch√≠nh x√°c.")
elif final_mrr > 0.5:
    print("üëç Nh·∫≠n x√©t: Kh√°. Model ƒë∆∞a ra ·ª©ng vi√™n ƒë√∫ng trong top 5 th∆∞·ªùng xuy√™n.")
else:
    print("‚ö†Ô∏è Nh·∫≠n x√©t: C·∫ßn c·∫£i thi·ªán. Ki·ªÉm tra l·∫°i Features ho·∫∑c Model.")

print(f"\nüìÅ File k·∫øt qu·∫£ (pred.json) l∆∞u t·∫°i: {os.path.abspath(OUTPUT_DIR)}")


üìä FINAL EVALUATION REPORT
Papers Processed: 39
Total Queries:    812
Metric MRR:       0.8349
üåü Nh·∫≠n x√©t: Xu·∫•t s·∫Øc. Model x·∫øp h·∫°ng reference r·∫•t ch√≠nh x√°c.

üìÅ File k·∫øt qu·∫£ (pred.json) l∆∞u t·∫°i: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011\notebooks\submission_output
