# Tier 02: Semantic Analysis (UniXcoder) - Training & Evaluation

This notebook fine-tunes **UniXcoder** for semantic clone detection using AST-flattened representations.

**Tier 02 Overview:**
- **Approach**: AST-based semantic embeddings with fine-tuned UniXcoder
- **Model**: UniXcoder (RoBERTa-based) with InfoNCE loss
- **Input**: AST-flattened code sequences
- **Output**: Similarity scores for semantic clone detection
- **Purpose**: Handles ambiguous cases from Tier 1 (0.4 < P < 0.8)

## Step 1: Import Libraries and Setup

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import torch
import matplotlib.pyplot as plt

# Add project root to path
BASE_DIR = Path.cwd()
sys.path.append(str(BASE_DIR))

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Working directory: {BASE_DIR}")
print(f"Device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
print("✓ Libraries imported successfully")

## Step 2: Load Training Data

In [None]:
DATA_PATH = BASE_DIR / "datasets" / "processing" / "unixcoder_training_data.parquet"

if DATA_PATH.exists():
    df = pd.read_parquet(DATA_PATH)
    print(f"✓ Loaded {len(df)} training samples")
    print(f"\nDataset Info:")
    print(f"  Columns: {list(df.columns)}")
    if 'label' in df.columns:
        print(f"  Label distribution:")
        print(df['label'].value_counts())
else:
    print(f"✗ Error: Training data not found at {DATA_PATH}")
    print("  Please run the data preparation notebook first.")

## Step 3: Fine-tune UniXcoder

Train UniXcoder with InfoNCE loss on AST-flattened code representations.

**Training Configuration:**
- Model: microsoft/unixcoder-base
- Loss: InfoNCE (contrastive learning)
- Batch size: 16 (configurable)
- Learning rate: 2e-5
- Epochs: 3-5

**Note:** This step may take significant time (hours) depending on dataset size and hardware.

In [None]:
from semantic.scripts import train_unixcoder

print("="*60)
print("FINE-TUNING UNIXCODER")
print("="*60)
print("This may take several hours depending on your hardware...")

# Run training
# You can modify parameters in the train_unixcoder.py script if needed
train_unixcoder.main()

# Check if model was saved
MODEL_DIR = BASE_DIR / "semantic" / "models" / "unixcoder_finetuned"
if MODEL_DIR.exists():
    print(f"\n✓ Model saved to: {MODEL_DIR}")
    print(f"  Model files: {list(MODEL_DIR.glob('*'))}")
else:
    print("\n✗ Error: Model not saved")

## Step 4: Evaluate UniXcoder

Evaluate the fine-tuned model on the test set.

In [None]:
from semantic.scripts import evaluate_unixcoder

print("="*60)
print("EVALUATING TIER 2 (SEMANTIC)")
print("="*60)

# Run evaluation
evaluate_unixcoder.main()

print("\n✓ Tier 2 evaluation completed")

## Summary

Tier 2 (Semantic - UniXcoder) training and evaluation completed!

**What was done:**
- ✓ Fine-tuned UniXcoder on AST-flattened code representations
- ✓ Trained with InfoNCE contrastive loss
- ✓ Evaluated model performance on semantic similarity
- ✓ Saved model to `semantic/models/unixcoder_finetuned/`

**How it works:**
1. Code is flattened into AST sequences
2. UniXcoder generates embeddings
3. Cosine similarity determines clone probability
4. Handles ambiguous cases from Tier 1 (0.4 < P < 0.8)

Cases still ambiguous after Tier 2 can be sent to Tier 3 (Provenance Analysis).