# Tier 03: Provenance Analysis - Training & Evaluation

This notebook covers **Provenance Analysis** for detecting clones based on code origin and history.

**Tier 03 Overview:**
- **Approach**: Code provenance and history analysis
- **Purpose**: Final tier for highly ambiguous cases
- **Methods**: Author metadata, commit history, file lineage
- **Output**: Clone classification based on provenance evidence

**Note:** This tier is typically implemented with external version control analysis tools and may require integration with Git repositories or other VCS systems.

## Step 1: Import Libraries and Setup

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import json

# Add project root to path
BASE_DIR = Path.cwd()
sys.path.append(str(BASE_DIR))

print(f"Working directory: {BASE_DIR}")
print("âœ“ Libraries imported successfully")

## Step 2: Provenance Analysis Overview

Provenance analysis examines:
- **Author information**: Who wrote the code
- **Commit history**: When and how code evolved
- **File lineage**: Code movement and refactoring
- **Repository metadata**: Project relationships

This tier helps distinguish between:
- **Independent development**: Different authors creating similar solutions
- **Code reuse**: Legitimate copying with attribution
- **Plagiarism**: Unauthorized copying without attribution

## Step 3: Check Provenance Router

The provenance module provides API endpoints for provenance-based clone detection.

In [None]:
from provenance import router as provenance_router

print("Provenance module loaded successfully")
print("\nAvailable in provenance module:")
print(dir(provenance_router))

## Step 4: Provenance Analysis Example

Example workflow for provenance-based analysis:

1. Extract Git metadata from repositories
2. Analyze commit history and authorship
3. Compare file lineage between code pairs
4. Generate provenance similarity scores
5. Make final clone determination

In [None]:
# Example: Provenance-based analysis
# This would typically integrate with Git repositories

example_provenance_data = {
    "code_pair_1": {
        "file_a": {
            "author": "Alice",
            "created": "2024-01-15",
            "repository": "project-a",
            "commit_history": ["init", "refactor", "optimize"]
        },
        "file_b": {
            "author": "Bob",
            "created": "2024-03-20",
            "repository": "project-b",
            "commit_history": ["init", "copy from project-a"]
        },
        "analysis": "Likely code reuse - commit message indicates copying"
    },
    "code_pair_2": {
        "file_a": {
            "author": "Charlie",
            "created": "2024-02-10",
            "repository": "project-c"
        },
        "file_b": {
            "author": "Diana",
            "created": "2024-02-12",
            "repository": "project-d"
        },
        "analysis": "Independent development - different authors, close timing"
    }
}

print("Provenance Analysis Examples:")
print(json.dumps(example_provenance_data, indent=2))

## Summary

Tier 3 (Provenance Analysis) overview completed!

**Key Points:**
- Provenance analysis examines code history and origin
- Useful for distinguishing independent development from copying
- Requires integration with version control systems (Git, etc.)
- Provides final decision for highly ambiguous cases

**Implementation Notes:**
- Provenance module is available at `provenance/router.py`
- Typically used through API endpoints
- Requires access to repository metadata
- Can be extended with custom provenance extraction logic

**Next Steps:**
- Integrate with Git repositories for real provenance data
- Implement author attribution analysis
- Add commit history comparison
- Build provenance similarity scoring