# Notebook 02: Data Augmentation (Negative Sampling)

**M·ª•c ti√™u:** Sinh d·ªØ li·ªáu hu·∫•n luy·ªán cho model Machine Learning (Pairwise Ranking).

ƒê·ªÉ train model, ta c·∫ßn t·∫°o ra c√°c c·∫∑p d·ªØ li·ªáu `(Query, Candidate, Label)`:
* **Query:** Th√¥ng tin tr√≠ch xu·∫•t t·ª´ BibTeX (ƒë√£ clean).
* **Candidate:** Th√¥ng tin t·ª´ Ground Truth (Metadata).
* **Label:** 1 (Match) ho·∫∑c 0 (Non-match).

**Chi·∫øn l∆∞·ª£c sinh m·∫´u sai (Negative Sampling):**
1.  **Local Negatives (Hard):** Ch·ªçn metadata sai n·∫±m *trong c√πng m·ªôt b√†i b√°o* (`paper_id` gi·ªëng nhau). V√≠ d·ª•: BibTeX ref #1 gh√©p v·ªõi GT ref #2 c·ªßa c√πng b√†i b√°o ƒë√≥. ƒê√¢y l√† m·∫´u kh√≥ v√¨ c√πng ch·ªß ƒë·ªÅ.
2.  **Global Negatives (Easy):** Ch·ªçn metadata sai t·ª´ *b·∫•t k·ª≥ b√†i b√°o n√†o kh√°c*. Gi√∫p model h·ªçc ph√¢n bi·ªát s·ª± kh√°c bi·ªát r√µ r·ªát (kh√°c nƒÉm, kh√°c t√°c gi·∫£...).



In [25]:

import pandas as pd
import numpy as np
import random
import os
from tqdm import tqdm

# --- C·∫§U H√åNH ---
INPUT_FILE = '../../dataset_final/clean_data/cleaned_data.pkl' # Load file PKL t·ª´ NB01
OUTPUT_FILE = '../../dataset_final/clean_data/train_augmented.pkl' # L∆∞u d·∫°ng PKL cho nhanh

# T·ª∑ l·ªá Negative/Positive
# 1 Positive s·∫Ω ƒëi k√®m v·ªõi bao nhi√™u Negative?
NUM_NEGATIVES = 4  
# Trong ƒë√≥ bao nhi√™u l√† Hard Negative (c√πng b√†i b√°o)?
NUM_HARD_NEGATIVES = 2 
# S·ªë c√≤n l·∫°i s·∫Ω l√† Easy Negative (random b√†i kh√°c)

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)



## 1. Load d·ªØ li·ªáu s·∫°ch (Cleaned Data)
Ta load file `.pkl` ƒë∆∞·ª£c t·∫°o ra t·ª´ Notebook 01 ƒë·ªÉ t·∫≠n d·ª•ng c√°c tr∆∞·ªùng `clean_title`, `clean_authors`, `clean_id`.


In [26]:

if not os.path.exists(INPUT_FILE):
    raise FileNotFoundError(f"‚ùå Kh√¥ng t√¨m th·∫•y file: {INPUT_FILE}. H√£y ch·∫°y Notebook 01 tr∆∞·ªõc.")

df_all = pd.read_pickle(INPUT_FILE)

# Ch·ªâ l·∫•y t·∫≠p Train ƒë·ªÉ Augment
df_train_src = df_all[df_all['partition'] == 'train'].copy()

print(f"T·ªïng s·ªë m·∫´u g·ªëc (Partition = Train): {len(df_train_src)}")
print("C√°c c·ªôt c√≥ s·∫µn:", df_train_src.columns.tolist())


T·ªïng s·ªë m·∫´u g·ªëc (Partition = Train): 6720
C√°c c·ªôt c√≥ s·∫µn: ['partition', 'source_type', 'key', 'paper_id', 'bib_content', 'gt_id', 'gt_title', 'gt_authors', 'clean_title', 'clean_authors', 'clean_id', 'clean_year', 'parse_method', 'gt_year']



## 2. X√¢y d·ª±ng Candidate Pools
Ta c·∫ßn t·∫°o kho ch·ª©a c√°c ·ª©ng vi√™n (Ground Truth) ƒë·ªÉ l·∫•y m·∫´u.


In [27]:
global_candidates = df_train_src[['gt_id', 'gt_title', 'gt_authors', 'gt_year', 'paper_id']].to_dict('records')

# 2. Local Pool: Gom nh√≥m theo Paper ID
local_pool = {}
for item in global_candidates:
    pid = item['paper_id']
    if pid not in local_pool:
        local_pool[pid] = []
    local_pool[pid].append(item)

print(f"Global Pool Size: {len(global_candidates)}")

Global Pool Size: 6720


## 3. Th·ª±c hi·ªán Negative Sampling
Quy tr√¨nh cho m·ªói d√≤ng BibTeX (Query):
1.  T·∫°o 1 c·∫∑p **Positive** (Ch√≠nh n√≥).
2.  T·∫°o k c·∫∑p **Hard Negative** (L·∫•y t·ª´ `local_pool` c·ªßa c√πng `paper_id`).
3.  T·∫°o m c·∫∑p **Easy Negative** (L·∫•y ng·∫´u nhi√™n t·ª´ `global_candidates` kh√°c `paper_id`).


In [28]:
augmented_rows = []

print("üöÄ ƒêang sinh d·ªØ li·ªáu training (ƒë√£ bao g·ªìm Year)...")

for idx, row in tqdm(df_train_src.iterrows(), total=len(df_train_src)):
    # --- A. L·∫§Y TH√îNG TIN QUERY (BIBTEX) ---
    query_info = {
        'bib_title': row['clean_title'],
        'bib_authors': row['clean_authors'],
        'bib_id': row['clean_id'],
        'bib_year': row['clean_year'] # NƒÉm tr√≠ch xu·∫•t t·ª´ BibTeX
    }
    
    true_gt_id = row['gt_id']
    current_paper_id = row['paper_id']

    # --- B. T·∫†O POSITIVE SAMPLE (LABEL = 1) ---
    pos_row = query_info.copy()
    pos_row.update({
        'cand_id': row['gt_id'],
        'cand_title': row['gt_title'],
        'cand_authors': row['gt_authors'],
        'cand_year': row['gt_year'],  # <--- QUAN TR·ªåNG: NƒÉm c·ªßa Ground Truth
        'label': 1
    })
    augmented_rows.append(pos_row)

    # --- C. T·∫†O NEGATIVE SAMPLES (LABEL = 0) ---
    negatives_collected = 0
    
    # C.1: Hard Negatives (Local)
    local_candidates = local_pool.get(current_paper_id, [])
    valid_local_cands = [c for c in local_candidates if c['gt_id'] != true_gt_id]
    
    if valid_local_cands:
        k_hard = min(NUM_HARD_NEGATIVES, len(valid_local_cands))
        chosen_hard = random.sample(valid_local_cands, k_hard)
        
        for cand in chosen_hard:
            neg_row = query_info.copy()
            neg_row.update({
                'cand_id': cand['gt_id'],
                'cand_title': cand['gt_title'],
                'cand_authors': cand['gt_authors'],
                'cand_year': cand['gt_year'], # <--- L·∫•y nƒÉm c·ªßa ·ª©ng vi√™n sai
                'label': 0
            })
            augmented_rows.append(neg_row)
            negatives_collected += 1

    # C.2: Easy Negatives (Global)
    needed = NUM_NEGATIVES - negatives_collected
    attempts = 0
    while needed > 0 and attempts < 50:
        attempts += 1
        cand = random.choice(global_candidates)
        
        if cand['gt_id'] != true_gt_id and cand['paper_id'] != current_paper_id:
            neg_row = query_info.copy()
            neg_row.update({
                'cand_id': cand['gt_id'],
                'cand_title': cand['gt_title'],
                'cand_authors': cand['gt_authors'],
                'cand_year': cand['gt_year'], # <--- L·∫•y nƒÉm c·ªßa ·ª©ng vi√™n sai
                'label': 0
            })
            augmented_rows.append(neg_row)
            needed -= 1

üöÄ ƒêang sinh d·ªØ li·ªáu training (ƒë√£ bao g·ªìm Year)...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6720/6720 [00:00<00:00, 26678.07it/s]


## 4. L∆∞u k·∫øt qu·∫£
L∆∞u th√†nh DataFrame ƒë·ªÉ d√πng cho b∆∞·ªõc Feature Engineering.


In [29]:

df_augmented = pd.DataFrame(augmented_rows)

print("\n--- K·∫æT QU·∫¢ DATA AUGMENTATION ---")
print(f"S·ªë l∆∞·ª£ng m·∫´u ban ƒë·∫ßu: {len(df_train_src)}")
print(f"S·ªë l∆∞·ª£ng m·∫´u sau khi sinh: {len(df_augmented)}")
print(f"T·ª∑ l·ªá Positive/Negative:\n{df_augmented['label'].value_counts()}")

# Shuffle d·ªØ li·ªáu
df_augmented = df_augmented.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

# L∆∞u file PKL (Gi·ªØ nguy√™n ki·ªÉu d·ªØ li·ªáu List cho Authors)
df_augmented.to_pickle(OUTPUT_FILE)

print(f"\n‚úÖ ƒê√£ l∆∞u file dataset hu·∫•n luy·ªán t·∫°i: {os.path.abspath(OUTPUT_FILE)}")
print("üëâ B∆Ø·ªöC TI·∫æP THEO: Ch·∫°y '03_feature_engineering.ipynb' ƒë·ªÉ t·∫°o vector ƒë·∫∑c tr∆∞ng t·ª´ file n√†y.")



--- K·∫æT QU·∫¢ DATA AUGMENTATION ---
S·ªë l∆∞·ª£ng m·∫´u ban ƒë·∫ßu: 6720
S·ªë l∆∞·ª£ng m·∫´u sau khi sinh: 33600
T·ª∑ l·ªá Positive/Negative:
label
0    26880
1     6720
Name: count, dtype: int64

‚úÖ ƒê√£ l∆∞u file dataset hu·∫•n luy·ªán t·∫°i: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\dataset_final\clean_data\train_augmented.pkl
üëâ B∆Ø·ªöC TI·∫æP THEO: Ch·∫°y '03_feature_engineering.ipynb' ƒë·ªÉ t·∫°o vector ƒë·∫∑c tr∆∞ng t·ª´ file n√†y.


## 5. Ki·ªÉm tra m·∫´u d·ªØ li·ªáu (Sanity Check)
Ki·ªÉm tra xem m·∫´u Hard Negative tr√¥ng nh∆∞ th·∫ø n√†o.


In [30]:

# %%
# L·∫•y 1 m·∫´u Positive
print("--- M·∫™U POSITIVE ---")
pos_sample = df_augmented[df_augmented['label'] == 1].iloc[0]
print(f"Bib Title:  {pos_sample['bib_title']}")
print(f"Cand Title: {pos_sample['cand_title']}")
print(f"Match ID:   {pos_sample['bib_id']} == {pos_sample['cand_id']}")

print("\n--- M·∫™U NEGATIVE (Check xem Title c√≥ kh√°c nhau kh√¥ng) ---")
neg_sample = df_augmented[df_augmented['label'] == 0].iloc[0]
print(f"Bib Title:  {neg_sample['bib_title']}")
print(f"Cand Title: {neg_sample['cand_title']}")
print(f"Label:      {neg_sample['label']}")

--- M·∫™U POSITIVE ---
Bib Title:  Adding conditional control to text-to-image diffusion models
Cand Title: Adding Conditional Control to Text-to-Image Diffusion Models
Match ID:    == 2302-05543

--- M·∫™U NEGATIVE (Check xem Title c√≥ kh√°c nhau kh√¥ng) ---
Bib Title:  Personalized federated learning via variational bayesian inference.
Cand Title: Narrow-Line Cooling and Imaging of Ytterbium Atoms in an Optical Tweezer Array.
Label:      0


In [None]:
# l∆∞u json ƒë·ªÉ check

import json
with open('sample_positive.json', 'w', encoding='utf-8') as f:
    json.dump(pos_sample.to_dict(), f, ensure_ascii=False, indent=4)