# Data Preparation Pipeline

This notebook prepares the dataset for training the code clone detection system by:
1. Filtering BigCloneBench (BCB) data
2. Extracting code fragments
3. Normalizing fragments
4. Creating unified dataset

**Pipeline Steps:**
- `01_filter_bcb.py`: Filters BCB dataset for clone types 1, 2, and 3
- `02_extract_fragments.py`: Extracts method fragments using Tree-sitter
- `03_normalize.py`: Normalizes code with blind renaming
- `04_create_unified.py`: Creates unified training dataset

## Step 1: Import Required Libraries

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import json

# Add project root to path
BASE_DIR = Path.cwd()
sys.path.append(str(BASE_DIR))

print(f"Working directory: {BASE_DIR}")
print("✓ Libraries imported successfully")

## Step 2: Filter BCB Dataset

Filter the BigCloneBench dataset to keep clone types 1, 2, and 3, remove missing code, and generate method IDs.

In [None]:
# Import the script as a module
from datasets.scripts import filter_bcb as step1

# Run the filtering
print("="*60)
print("STEP 1: FILTERING BCB DATASET")
print("="*60)
step1.main()

# Verify output
csv_path = BASE_DIR / "datasets" / "processing" / "bcb_filtered.csv"
if csv_path.exists():
    df = pd.read_csv(csv_path)
    print(f"\n✓ Filtered dataset created: {len(df)} pairs")
    print(f"  Clone type distribution:")
    print(df['clone_type'].value_counts())
else:
    print("✗ Error: Filtered dataset not created")

## Step 3: Extract Code Fragments

Parse Java files and extract method-level code fragments using Tree-sitter.

In [None]:
from datasets.scripts import extract_fragments as step2

print("="*60)
print("STEP 2: EXTRACTING CODE FRAGMENTS")
print("="*60)
step2.main()

# Verify output
jsonl_path = BASE_DIR / "datasets" / "processing" / "fragments.jsonl"
if jsonl_path.exists():
    with open(jsonl_path, 'r') as f:
        fragments = [json.loads(line) for line in f]
    print(f"\n✓ Extracted {len(fragments)} fragments")
    print(f"  Sample fragment keys: {list(fragments[0].keys()) if fragments else 'None'}")
else:
    print("✗ Error: Fragments file not created")

## Step 4: Normalize Code Fragments

Apply blind renaming to normalize identifiers and literals.

In [None]:
from datasets.scripts import normalize as step3

print("="*60)
print("STEP 3: NORMALIZING FRAGMENTS")
print("="*60)
step3.main()

# Verify output
norm_path = BASE_DIR / "datasets" / "processing" / "fragments_normalized.jsonl"
if norm_path.exists():
    with open(norm_path, 'r') as f:
        normalized = [json.loads(line) for line in f]
    print(f"\n✓ Normalized {len(normalized)} fragments")
    if normalized:
        print(f"  Sample normalized code: {normalized[0].get('normalized_code', '')[:100]}...")
else:
    print("✗ Error: Normalized fragments file not created")

## Step 5: Create Unified Training Dataset

Create a unified dataset combining BCB and CodeNet data with train/validation/test splits.

In [None]:
from datasets.scripts import create_unified as step4

print("="*60)
print("STEP 4: CREATING UNIFIED DATASET")
print("="*60)
step4.main()

# Verify output
unified_dir = BASE_DIR / "datasets" / "processing" / "unified"
if unified_dir.exists():
    files = list(unified_dir.glob("*.parquet"))
    print(f"\n✓ Created unified dataset with {len(files)} files:")
    for f in sorted(files):
        df_temp = pd.read_parquet(f)
        print(f"  - {f.name}: {len(df_temp)} rows")
else:
    print("✗ Error: Unified dataset directory not created")

## Summary

Data preparation pipeline completed! The following files have been created:

1. **bcb_filtered.csv**: Filtered clone pairs
2. **fragments.jsonl**: Extracted code fragments
3. **fragments_normalized.jsonl**: Normalized fragments
4. **unified/*.parquet**: Training datasets for all tiers

You can now proceed to training notebooks.