# Tier 01: Syntactic Analysis (TOMA) - Training & Evaluation

This notebook trains and evaluates the **TOMA (Token-based Metric Aggregation)** classifier for detecting Type-3 clones.

**Tier 01 Overview:**
- **Approach**: Token-based similarity metrics (Jaccard, Dice, Cosine, Levenshtein, etc.)
- **Model**: Random Forest Classifier
- **Features**: 6-dimensional feature vector (syntactic similarity metrics)
- **Output**: Classifies code pairs as TYPE_3, AMBIGUOUS, or NON_CLONE

## Step 1: Import Libraries and Setup

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Add project root to path
BASE_DIR = Path.cwd()
sys.path.append(str(BASE_DIR))

print(f"Working directory: {BASE_DIR}")
print("✓ Libraries imported successfully")

## Step 2: Load Training Data

In [None]:
DATA_PATH = BASE_DIR / "datasets" / "processing" / "unixcoder_training_data.parquet"

if DATA_PATH.exists():
    df = pd.read_parquet(DATA_PATH)
    print(f"✓ Loaded {len(df)} training samples")
    print(f"\nDataset Info:")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Label distribution:")
    print(df['label'].value_counts())
else:
    print(f"✗ Error: Training data not found at {DATA_PATH}")
    print("  Please run the data preparation notebook first.")

## Step 3: Train TOMA Classifier

Run the training script which computes syntactic features and trains a Random Forest classifier.

In [None]:
from syntactic.scripts import train_evaluate_toma

print("="*60)
print("TRAINING TOMA CLASSIFIER")
print("="*60)

# Run training with default parameters
train_evaluate_toma.main()

# Load the trained model
MODEL_PATH = BASE_DIR / "syntactic" / "models" / "classifier.joblib"
if MODEL_PATH.exists():
    model = joblib.load(MODEL_PATH)
    print(f"\n✓ Model saved to: {MODEL_PATH}")
    print(f"  Model type: {type(model).__name__}")
else:
    print("\n✗ Error: Model not saved")

## Step 4: Evaluate on Test Set

Run evaluation on the tier 1 test set to get detailed metrics.

In [None]:
from syntactic.scripts import evaluate_tier1

print("="*60)
print("EVALUATING TIER 1 (SYNTACTIC)")
print("="*60)

# Run evaluation
evaluate_tier1.main()

print("\n✓ Tier 1 evaluation completed")

## Summary

Tier 1 (Syntactic - TOMA) training and evaluation completed!

**What was done:**
- ✓ Trained Random Forest classifier on syntactic features
- ✓ Evaluated model performance on test set
- ✓ Saved model to `syntactic/models/classifier.joblib`

**Classification Thresholds:**
- P ≥ 0.8: TYPE_3 Clone
- 0.4 < P < 0.8: AMBIGUOUS (sent to Tier 2)
- P ≤ 0.4: NON_CLONE

Ambiguous cases will be processed by Tier 2 (Semantic Analysis).