# Notebook 02: Data Augmentation (Negative Sampling)

**M·ª•c ti√™u:** Sinh d·ªØ li·ªáu hu·∫•n luy·ªán cho model Machine Learning (Pairwise Ranking).

ƒê·ªÉ train model, ta c·∫ßn t·∫°o ra c√°c c·∫∑p d·ªØ li·ªáu `(Query, Candidate, Label)`:
* **Query:** Th√¥ng tin tr√≠ch xu·∫•t t·ª´ BibTeX (ƒë√£ clean).
* **Candidate:** Th√¥ng tin t·ª´ Ground Truth (Metadata).
* **Label:** 1 (Match) ho·∫∑c 0 (Non-match).

**Chi·∫øn l∆∞·ª£c sinh m·∫´u sai (Negative Sampling):**
1.  **Local Negatives (Hard):** Ch·ªçn metadata sai n·∫±m *trong c√πng m·ªôt b√†i b√°o* (`paper_id` gi·ªëng nhau). V√≠ d·ª•: BibTeX ref #1 gh√©p v·ªõi GT ref #2 c·ªßa c√πng b√†i b√°o ƒë√≥. ƒê√¢y l√† m·∫´u kh√≥ v√¨ c√πng ch·ªß ƒë·ªÅ.
2.  **Global Negatives (Easy):** Ch·ªçn metadata sai t·ª´ *b·∫•t k·ª≥ b√†i b√°o n√†o kh√°c*. Gi√∫p model h·ªçc ph√¢n bi·ªát s·ª± kh√°c bi·ªát r√µ r·ªát (kh√°c nƒÉm, kh√°c t√°c gi·∫£...).



In [109]:
import pandas as pd
import numpy as np
import random
import os
import sys

# Setup path ƒë·ªÉ import src module
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

# Import t·ª´ src.ml module
from src.ml import (
    load_pickle,
    save_pickle,
    augment_dataset,
    build_candidate_pools,
    split_by_partition
)

# --- C·∫§U H√åNH ---
INPUT_FILE = '../../dataset_final/clean_data/cleaned_data.pkl'
OUTPUT_FILE = '../../dataset_final/clean_data/train_augmented.pkl'

# T·ª∑ l·ªá Negative/Positive
NUM_NEGATIVES = 4  
NUM_HARD_NEGATIVES = 2 

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print("‚úÖ Import modules th√†nh c√¥ng t·ª´ src.ml!")

‚úÖ Import modules th√†nh c√¥ng t·ª´ src.ml!



## 1. Load d·ªØ li·ªáu s·∫°ch (Cleaned Data)
Ta load file `.pkl` ƒë∆∞·ª£c t·∫°o ra t·ª´ Notebook 01 ƒë·ªÉ t·∫≠n d·ª•ng c√°c tr∆∞·ªùng `clean_title`, `clean_authors`, `clean_id`.


In [110]:
# Load d·ªØ li·ªáu s·∫°ch s·ª≠ d·ª•ng h√†m t·ª´ module
if not os.path.exists(INPUT_FILE):
    raise FileNotFoundError(f"‚ùå Kh√¥ng t√¨m th·∫•y file: {INPUT_FILE}. H√£y ch·∫°y Notebook 01 tr∆∞·ªõc.")

df_all = load_pickle(INPUT_FILE)

# L·∫•y c·∫£ train + validation ƒë·ªÉ augment
df_train_src = df_all[df_all['partition'].isin(['train', 'validation'])].copy()

print(f"T·ªïng s·ªë m·∫´u g·ªëc (Train + Validation): {len(df_train_src)}")
print("C√°c c·ªôt c√≥ s·∫µn:", df_train_src.columns.tolist())

T·ªïng s·ªë m·∫´u g·ªëc (Train + Validation): 13455
C√°c c·ªôt c√≥ s·∫µn: ['partition', 'source_file', 'key', 'paper_id', 'bib_content', 'clean_title', 'clean_authors', 'clean_id', 'clean_year', 'parse_method', 'gt_id', 'gt_title', 'gt_authors', 'gt_year']



## 2. X√¢y d·ª±ng Candidate Pools
Ta c·∫ßn t·∫°o kho ch·ª©a c√°c ·ª©ng vi√™n (Ground Truth) ƒë·ªÉ l·∫•y m·∫´u.


In [111]:
# S·ª≠ d·ª•ng h√†m build_candidate_pools t·ª´ module
global_candidates, local_pool = build_candidate_pools(df_train_src)

print(f"Global Pool Size: {len(global_candidates)}")
print(f"Number of Papers (Local Pools): {len(local_pool)}")

Global Pool Size: 13455
Number of Papers (Local Pools): 610


## 3. Th·ª±c hi·ªán Negative Sampling
Quy tr√¨nh cho m·ªói d√≤ng BibTeX (Query):
1.  T·∫°o 1 c·∫∑p **Positive** (Ch√≠nh n√≥).
2.  T·∫°o k c·∫∑p **Hard Negative** (L·∫•y t·ª´ `local_pool` c·ªßa c√πng `paper_id`).
3.  T·∫°o m c·∫∑p **Easy Negative** (L·∫•y ng·∫´u nhi√™n t·ª´ `global_candidates` kh√°c `paper_id`).


In [112]:
# S·ª≠ d·ª•ng h√†m augment_dataset t·ª´ module
# H√†m n√†y ƒë√£ bao g·ªìm to√†n b·ªô logic sinh Positive + Negative samples

df_augmented = augment_dataset(
    df_source=df_train_src,
    num_negatives=NUM_NEGATIVES,
    num_hard_negatives=NUM_HARD_NEGATIVES,
    random_seed=RANDOM_SEED,
    verbose=True
)

üìä Global Pool Size: 13455
üìä Number of Papers: 610


üöÄ Generating samples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13455/13455 [00:00<00:00, 21628.23it/s]



‚úÖ Augmentation completed!
   - Original samples: 13455
   - Augmented samples: 67275
   - Label distribution:
label
0    53820
1    13455
Name: count, dtype: int64


## 4. L∆∞u k·∫øt qu·∫£
L∆∞u th√†nh DataFrame ƒë·ªÉ d√πng cho b∆∞·ªõc Feature Engineering.


In [113]:
# L∆∞u file s·ª≠ d·ª•ng module
print("\n--- K·∫æT QU·∫¢ DATA AUGMENTATION ---")
print(f"S·ªë l∆∞·ª£ng m·∫´u ban ƒë·∫ßu: {len(df_train_src)}")
print(f"S·ªë l∆∞·ª£ng m·∫´u sau khi sinh: {len(df_augmented)}")
print(f"T·ª∑ l·ªá Positive/Negative:\n{df_augmented['label'].value_counts()}")

# L∆∞u file PKL
df_augmented.to_pickle(OUTPUT_FILE)

print(f"\n‚úÖ ƒê√£ l∆∞u file dataset hu·∫•n luy·ªán t·∫°i: {os.path.abspath(OUTPUT_FILE)}")
print("üëâ B∆Ø·ªöC TI·∫æP THEO: Ch·∫°y '03_feature_engineering.ipynb'")


--- K·∫æT QU·∫¢ DATA AUGMENTATION ---
S·ªë l∆∞·ª£ng m·∫´u ban ƒë·∫ßu: 13455
S·ªë l∆∞·ª£ng m·∫´u sau khi sinh: 67275
T·ª∑ l·ªá Positive/Negative:
label
0    53820
1    13455
Name: count, dtype: int64

‚úÖ ƒê√£ l∆∞u file dataset hu·∫•n luy·ªán t·∫°i: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\dataset_final\clean_data\train_augmented.pkl
üëâ B∆Ø·ªöC TI·∫æP THEO: Ch·∫°y '03_feature_engineering.ipynb'


## 5. Ki·ªÉm tra m·∫´u d·ªØ li·ªáu (Sanity Check)
Ki·ªÉm tra xem m·∫´u Hard Negative tr√¥ng nh∆∞ th·∫ø n√†o.


In [114]:

# %%
# L·∫•y 1 m·∫´u Positive
print("--- M·∫™U POSITIVE ---")
pos_sample = df_augmented[df_augmented['label'] == 1].iloc[0]
print(f"Bib Title:  {pos_sample['bib_title']}")
print(f"Cand Title: {pos_sample['cand_title']}")
print(f"Match ID:   {pos_sample['bib_id']} == {pos_sample['cand_id']}")

print("\n--- M·∫™U NEGATIVE (Check xem Title c√≥ kh√°c nhau kh√¥ng) ---")
neg_sample = df_augmented[df_augmented['label'] == 0].iloc[0]
print(f"Bib Title:  {neg_sample['bib_title']}")
print(f"Cand Title: {neg_sample['cand_title']}")
print(f"Label:      {neg_sample['label']}")

--- M·∫™U POSITIVE ---
Bib Title:  Ibrnet: Learning multi-view image-based rendering
Cand Title: IBRNet: Learning Multi-View Image-Based Rendering
Match ID:    == 2102-13090

--- M·∫™U NEGATIVE (Check xem Title c√≥ kh√°c nhau kh√¥ng) ---
Bib Title:  Graph States as a Resource for Quantum Metrology
Cand Title: Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning
Label:      0


In [115]:
# l∆∞u json ƒë·ªÉ check

import json
with open('sample_positive.json', 'w', encoding='utf-8') as f:
    json.dump(pos_sample.to_dict(), f, ensure_ascii=False, indent=4)