# 02 - Interactive Plagiarism Detection

This notebook provides four callable functions for plagiarism detection:
1. `detect_embedding()` - Pure embedding-based search
2. `detect_llm()` - Direct LLM analysis with full corpus
3. `detect_rag()` - Standard RAG (retrieve + generate)
4. `detect_hybrid_rag()` - Hybrid RAG (BM25 + Dense fusion)

All functions load pre-built indexes from `01_indexing.ipynb`.

In [1]:
import sys
import os
import numpy as np

# Add src to path
sys.path.append(os.path.abspath('.'))

from src.retrieval import DenseRetriever, BM25Retriever, HybridRetriever
from src.llm import GeminiLLM
from src.embeddings import EmbeddingGenerator
from src.config import DEFAULT_TOP_K, DEFAULT_SIMILARITY_THRESHOLD, DEFAULT_ALPHA

print("✓ Imports successful")

✓ Imports successful


  from .autonotebook import tqdm as notebook_tqdm


## Load Pre-built Indexes

In [2]:
print("Loading pre-built indexes...\n")

# Load dense retriever
dense_retriever = DenseRetriever.load("indexes/dense_retriever.pkl")
print(f"✓ Dense retriever loaded: {len(dense_retriever.chunks)} functions")

# Load BM25 retriever
bm25_retriever = BM25Retriever.load("indexes/bm25_retriever.pkl")
print(f"✓ BM25 retriever loaded: {len(bm25_retriever.chunks)} functions")

# Create hybrid retriever
hybrid_retriever = HybridRetriever(dense_retriever, bm25_retriever)
print(f"✓ Hybrid retriever initialized")

# Initialize LLM
llm = GeminiLLM()
print(f"✓ Gemini LLM initialized")

print("\n" + "="*50)
print("All systems ready!")
print("="*50)

Loading pre-built indexes...

✓ Dense retriever loaded: 3520 functions
✓ BM25 retriever loaded: 3520 functions
✓ Hybrid retriever initialized
✓ Gemini LLM initialized

All systems ready!


## System 1: Pure Embedding Search

In [3]:
def detect_embedding(query_code, threshold=DEFAULT_SIMILARITY_THRESHOLD, top_k=5):
    """
    Detect plagiarism using pure embedding similarity.
    
    Args:
        query_code: Code snippet to check
        threshold: Similarity threshold (default: 0.85)
        top_k: Number of similar functions to retrieve
        
    Returns:
        Dictionary with plagiarism detection results
    """
    # Retrieve most similar functions
    results = dense_retriever.retrieve(query_code, top_k=top_k)
    
    if not results:
        return {
            'is_plagiarism': False,
            'confidence': 0.0,
            'max_similarity': 0.0,
            'matched_function': None,
            'matches': []
        }
    
    # Get top match
    top_chunk, max_similarity = results[0]
    
    # Determine plagiarism based on threshold
    is_plagiarism = max_similarity >= threshold
    
    # Format matches
    matches = [
        {
            'function_name': chunk.function_name,
            'file_path': chunk.file_path,
            'similarity': float(score)
        }
        for chunk, score in results
    ]
    
    return {
        'is_plagiarism': is_plagiarism,
        'confidence': float(max_similarity * 100),
        'max_similarity': float(max_similarity),
        'matched_function': top_chunk.function_name if is_plagiarism else None,
        'matched_file': top_chunk.file_path if is_plagiarism else None,
        'matches': matches
    }

print("✓ detect_embedding() function defined")

✓ detect_embedding() function defined


## System 2: Direct LLM Analysis

In [4]:
def detect_llm(query_code):
    """
    detect plagiarism using direct llm analysis (zero-shot, no retrieval).
    """
    result = llm.analyze_plagiarism_direct(query_code)
    return result

print("✓ detect_llm() function defined")

✓ detect_llm() function defined


## System 3: Standard RAG

In [5]:
def detect_rag(query_code, top_k=DEFAULT_TOP_K):
    """
    Detect plagiarism using standard RAG (retrieve + generate).
    
    Args:
        query_code: Code snippet to check
        top_k: Number of documents to retrieve
        
    Returns:
        Dictionary with plagiarism detection results
    """
    # Retrieve top-k most similar functions
    retrieved = dense_retriever.retrieve(query_code, top_k=top_k)
    
    # Extract chunks
    candidate_chunks = [chunk for chunk, score in retrieved]
    
    # Analyze with LLM
    result = llm.analyze_plagiarism_with_context(query_code, candidate_chunks)
    
    # Add retrieval information
    result['retrieved_functions'] = [
        {
            'function_name': chunk.function_name,
            'similarity': float(score)
        }
        for chunk, score in retrieved[:5]  # Top 5 for summary
    ]
    
    return result

print("✓ detect_rag() function defined")

✓ detect_rag() function defined


## System 4: Hybrid RAG

In [6]:
def detect_hybrid_rag(query_code, top_k=DEFAULT_TOP_K, alpha=DEFAULT_ALPHA):
    """
    Detect plagiarism using hybrid RAG (BM25 + Dense embeddings fusion).
    
    Args:
        query_code: Code snippet to check
        top_k: Number of documents to retrieve
        alpha: Fusion weight (0=pure BM25, 1=pure dense)
        
    Returns:
        Dictionary with plagiarism detection results
    """
    # Hybrid retrieval
    retrieved = hybrid_retriever.retrieve(
        query_code, 
        top_k=top_k, 
        alpha=alpha,
        fusion_method='rrf'  # Reciprocal Rank Fusion
    )
    
    # Extract chunks
    candidate_chunks = [chunk for chunk, score in retrieved]
    
    # Analyze with LLM
    result = llm.analyze_plagiarism_with_context(query_code, candidate_chunks)
    
    # Add retrieval information
    result['retrieved_functions'] = [
        {
            'function_name': chunk.function_name,
            'fusion_score': float(score)
        }
        for chunk, score in retrieved[:5]  # Top 5 for summary
    ]
    result['fusion_alpha'] = alpha
    
    return result

print("✓ detect_hybrid_rag() function defined")

✓ detect_hybrid_rag() function defined


## Interactive Testing

Test the functions with sample code:

In [7]:
# Sample test: plagiarized code (renamed reverse_string)
test_code_plagiarism = """
def invert_string(s):
    return s[::-1]
"""

print("Testing with PLAGIARIZED code:")
print(test_code_plagiarism)
print("\n" + "="*50)

# Test with embedding search
result_emb = detect_embedding(test_code_plagiarism)
print("\n1. EMBEDDING SEARCH:")
print(f"   Plagiarism: {result_emb['is_plagiarism']}")
print(f"   Confidence: {result_emb['confidence']:.2f}%")
print(f"   Top match: {result_emb['matched_function']}")

# Test with RAG
result_rag = detect_rag(test_code_plagiarism)
print("\n2. STANDARD RAG:")
print(f"   Plagiarism: {result_rag['is_plagiarism']}")
print(f"   Confidence: {result_rag['confidence']}")
print(f"   Matched: {result_rag.get('matched_function', 'N/A')}")
print(f"   Reasoning: {result_rag.get('reasoning', 'N/A')[:100]}...")

Testing with PLAGIARIZED code:

def invert_string(s):
    return s[::-1]



1. EMBEDDING SEARCH:
   Plagiarism: False
   Confidence: 67.77%
   Top match: None

2. STANDARD RAG:
   Plagiarism: False
   Confidence: 0
   Matched: None
   Reasoning: The query code implements a simple string reversal using slicing. This is a common and basic program...


In [8]:
# Sample test: original code (not in corpus)
test_code_original = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

print("Testing with ORIGINAL code:")
print(test_code_original)
print("\n" + "="*50)

# Test with embedding search
result_emb = detect_embedding(test_code_original)
print("\n1. EMBEDDING SEARCH:")
print(f"   Plagiarism: {result_emb['is_plagiarism']}")
print(f"   Confidence: {result_emb['confidence']:.2f}%")
print(f"   Max similarity: {result_emb['max_similarity']:.4f}")

# Test with Hybrid RAG
result_hybrid = detect_hybrid_rag(test_code_original)
print("\n2. HYBRID RAG:")
print(f"   Plagiarism: {result_hybrid['is_plagiarism']}")
print(f"   Confidence: {result_hybrid['confidence']}")
print(f"   Reasoning: {result_hybrid.get('reasoning', 'N/A')[:100]}...")

Testing with ORIGINAL code:

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)



1. EMBEDDING SEARCH:
   Plagiarism: False
   Confidence: 64.38%
   Max similarity: 0.6438

2. HYBRID RAG:
   Plagiarism: False
   Confidence: 0
   Reasoning: The query code implements the fibonacci sequence recursively. None of the reference functions implem...


## Custom Testing

Test with your own code:

In [9]:
# Enter your code here
custom_code = """
def your_function_here():
    pass
"""

# Choose detection method
result = detect_embedding(custom_code)  # or detect_rag(), detect_hybrid_rag(), detect_llm()

print("Detection Result:")
print(result)

Detection Result:
{'is_plagiarism': False, 'confidence': 70.96100538881203, 'max_similarity': 0.7096100538881203, 'matched_function': None, 'matched_file': None, 'matches': [{'function_name': 'pass_obj', 'file_path': 'data/reference_corpus/click/src/click/decorators.py', 'similarity': 0.7096100538881203}, {'function_name': 'test_formatdef', 'file_path': 'data/reference_corpus/pluggy/testing/test_helpers.py', 'similarity': 0.7077624246798683}, {'function_name': 'make_pass_decorator', 'file_path': 'data/reference_corpus/click/src/click/decorators.py', 'similarity': 0.705070677213678}, {'function_name': 'm', 'file_path': 'data/reference_corpus/pluggy/testing/test_multicall.py', 'similarity': 0.6982292575019763}, {'function_name': 'm', 'file_path': 'data/reference_corpus/pluggy/testing/test_multicall.py', 'similarity': 0.6982292575019763}]}


## All Systems Ready!

The following functions are available:
- `detect_embedding(query_code, threshold, top_k)`
- `detect_llm(query_code, max_context_functions)`
- `detect_rag(query_code, top_k)`
- `detect_hybrid_rag(query_code, top_k, alpha)`

These will be used in `03_evaluation.ipynb` for full evaluation.