# Detector de plagio en código fuente (archivos .java)
Este notebook presenta el desarrollo de un modelo para detectar plagio en archivos de código fuente escritos en Java. Se utiliza el dataset "conplag", ubicado en   `dataset/versions/b_plag_version_2`, el cual contiene carpetas con archivos originales y versiones plagiadas.

## Objetivo
Desarrollar una herramienta híbrida automatizada que detecte plagio en código fuente, combinando análisis léxico (tokenización) y técnicas de aprendizaje automático. El objetivo es alcanzar una precisión superior al 80% en la clasificación de código como plagiado o no plagiado.

## Importación de librerías
Se importan las librerías necesarias para el procesamiento de archivos, tokenización de código Java, manipulación de datos y desarrollo del modelo de machine learning

In [None]:
import os
import random
from pathlib import Path
import pandas as pd
import numpy as np

from pygments import lex
from pygments.lexers import JavaLexer
from pygments.token import Token

from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
import numpy as np
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix
from collections import Counter
from scipy.spatial.distance import cosine

## Definición de rutas
Se definen las rutas base para acceder al dataset y a los archivos de etiquetas

In [2]:
ROOT = Path().cwd().parent
BASE_PATH = ROOT / "dataset" / "versions" / "bplag_version_2"
LABELS_PATH = ROOT / "dataset" / "versions" / "labels.csv"

## Lectura de archivos .java
La función read_java_files() se encarga de recorrer recursivamente la estructura del dataset y leer el contenido de todos los archivos .java. Devuelve una lista con los identificadores de cada entrega y su respectivo código fuente.

In [3]:
def read_java_files(base_path):
    """
    Recursively reads all .java files from the given base path.
    
    Args:
        base_path (str): Path to the base directory containing submission pairs.
    
    Returns:
        data (list): List of tuples (submission_id, code_content).
    """
    data = []
    
    for submission_pair in os.listdir(base_path):
        pair_path = os.path.join(base_path, submission_pair)
        
        if os.path.isdir(pair_path):
            for submission_id in os.listdir(pair_path):
                submission_path = os.path.join(pair_path, submission_id)
                
                if os.path.isdir(submission_path):
                    for file in os.listdir(submission_path):
                        if file.endswith('.java'):
                            file_path = os.path.join(submission_path, file)
                            with open(file_path, 'r', encoding='utf-8') as f:
                                code = f.read()
                                data.append((submission_id, code))
    
    return data

In [4]:
java_files_data = read_java_files(BASE_PATH)

print(f"Total submissions loaded: {len(java_files_data)}")
print("\nFirst 2 submissions loaded:")
for submission_id, code in java_files_data[:2]:
    print(f"Submission ID: {submission_id}\nCode snippet:\n{code[:300]}...\n")

Total submissions loaded: 1822

First 2 submissions loaded:
Submission ID: 0017d438
Code snippet:
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.*;
public class Main {
    static int modulo=998244353;
    public static void main(String[] args) {
       
        FastScanner in = new FastScanner();
     ...

Submission ID: 9852706b
Code snippet:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.util.*;

public class A {
    static List<Integer> [] adj;
    static ArrayList<Integer> temp;
    static int mod = (int) 1e9+7;
    static boolean[] vis = new boolean...



## Extracción de tokens

Para esta sección se utiliza la librería Pygments como analizador léxico. Esta librería permite extraer los tokens de un código fuente y clasificarlos en diferentes categorías. En este caso, se utilizará para extraer los tokens de código fuente en Java.

In [5]:
def extract_tokens(code):
    """
    Extracts tokens from the given Java code using Pygments.
    
    Args:
        code (str): Java code as a string.
        
    Returns:
        tokens (list): List of tokens extracted from the code.
    """
    lexer = JavaLexer()
    tokens = []
    for ttype, value in lex(code, lexer):
        if ttype in Token.Name or ttype in Token.Keyword or ttype in Token.Operator:
            val = value.strip()
            if val:
                tokens.append(f"{ttype.__class__.__name__}:{val}")
    return " ".join(tokens)

## Preparación de los datos y etiquetas
Se cargan las etiquetas desde el archivo labels.csv y se construye un diccionario para acceder fácilmente al veredicto (si un archivo es plagio o no) entre pares de entregas.

Luego, se recorren los datos de código Java en pares, extrayendo los tokens de cada archivo y concatenándolos como una sola representación textual del par. 

Paralelamente, se asigna la etiqueta correspondiente (plagiado o no) a cada par utilizando el diccionario previamente generado.

Este proceso genera dos listas:

- token_pairs: Representación textual de los pares de código.

- labels: Etiquetas binarias que indican si el par es plagiado.

In [6]:
labels_df = pd.read_csv(LABELS_PATH)

labels_dict = {}
for _, row in labels_df.iterrows():
    key = (row['sub1'], row['sub2'])
    labels_dict[key] = row['verdict']

token_pairs = []
labels = []

for i in range(0, len(java_files_data), 2):
    try:
        id1, code1 = java_files_data[i]
        id2, code2 = java_files_data[i+1]
    except IndexError:
        break
        
    t1 = extract_tokens(code1)
    t2 = extract_tokens(code2)
    token_pairs.append(f"{t1} {t2}")
    
    if (id1, id2) in labels_dict:
        labels.append(labels_dict[(id1, id2)])
    elif (id2, id1) in labels_dict:
        labels.append(labels_dict[(id2, id1)])
    else:
        print(f"Warning: No label found for pair ({id1}, {id2})")
        labels.append(0) 
        

## Data Augmentation
Debido a que nuestro dataset original contenía menos de 1000 pares de código, el modelo tenía una exposición limitada a ejemplos variados, lo que afectaba negativamente su capacidad de generalización y su rendimiento predictivo. Para mitigar este problema, implementamos una estrategia de aumento de datos que multiplicó el tamaño del conjunto original por 3 veces.

La función augment_token_sequence() permite generar nuevas versiones sintéticas de los pares de tokens aplicando pequeñas modificaciones controladas, lo que introduce variabilidad sin alterar significativamente la lógica del código original. Estas son las técnicas utilizadas:

- Shuffle: Mezcla aleatoriamente grupos de tokens dentro de una ventana, simulando cambios menores en el orden del código que pueden aparecer en plagios.

- Drop: Elimina tokens no críticos de forma aleatoria, simulando la omisión intencional de fragmentos para disimular el plagio.

- Duplicate: Duplica tokens aleatoriamente para emular redundancias introducidas deliberadamente por quien plagia.

- Synonym: Sustituye variables por sinónimos simulados (e.g., i por index), representando un cambio común en plagios superficiales.

Estas transformaciones ayudan a enriquecer el entrenamiento del modelo, exponiéndolo a variaciones más realistas en los datos y mejorando su capacidad para identificar plagio con mayor precisión.

La función generate_augmented_pairs() une los nuevos pares de tokens generados por la aumentación y les asigna las etiquetas correspondientes.

In [7]:
def augment_token_sequence(tokens, augmentation_method='shuffle', ratio=0.1):
    """
    Augments a token sequence using various methods.
    
    Args:
        tokens (str): The token sequence string to be augmented.
        augmentation_method (str): The method to use for augmentation.
            'shuffle': Randomly shuffles some tokens within a window.
            'drop': Randomly drops some tokens.
            'duplicate': Duplicates some tokens.
            'synonym': Replaces some tokens with synonyms (simulated).
        ratio (float): The percentage of tokens to modify (0.0 to 1.0).
    
    Returns:
        str: The augmented token sequence as a string.
    """
    token_list = tokens.split()
    total_tokens = len(token_list)
    num_to_modify = max(1, int(total_tokens * ratio))
    
    augmented_tokens = token_list.copy()
    
    if augmentation_method == 'shuffle':
        for _ in range(num_to_modify // 2): 
            window_size = random.randint(2, 4) 
            if total_tokens <= window_size:
                continue
                
            start_idx = random.randint(0, total_tokens - window_size)
            window = augmented_tokens[start_idx:start_idx + window_size]
            random.shuffle(window)
            augmented_tokens[start_idx:start_idx + window_size] = window
    
    elif augmentation_method == 'drop':
        critical_tokens = ['_TokenType:public', '_TokenType:class', '_TokenType:import', '_TokenType:static', '_TokenType:void', '_TokenType:main']
        indices_to_drop = []
        
        for _ in range(num_to_modify):
            valid_indices = [i for i, token in enumerate(augmented_tokens) 
                            if token not in critical_tokens and i not in indices_to_drop]
            if not valid_indices:
                break
            idx = random.choice(valid_indices)
            indices_to_drop.append(idx)
        
        augmented_tokens = [token for i, token in enumerate(augmented_tokens) if i not in indices_to_drop]
    
    elif augmentation_method == 'duplicate':
        for _ in range(num_to_modify):
            if not augmented_tokens:
                break
            idx = random.randint(0, len(augmented_tokens) - 1)
            augmented_tokens.insert(idx, augmented_tokens[idx])
    
    elif augmentation_method == 'synonym':
        variable_prefixes = ['_TokenType:i', '_TokenType:j', '_TokenType:k', '_TokenType:n', '_TokenType:m', '_TokenType:x', '_TokenType:y']
        synonym_mapping = {
            '_TokenType:i': ['_TokenType:idx', '_TokenType:index', '_TokenType:i'],
            '_TokenType:j': ['_TokenType:jdx', '_TokenType:j'],
            '_TokenType:k': ['_TokenType:kdx', '_TokenType:key', '_TokenType:k'],
            '_TokenType:n': ['_TokenType:num', '_TokenType:size', '_TokenType:n'],
            '_TokenType:m': ['_TokenType:max', '_TokenType:m'],
            '_TokenType:x': ['_TokenType:xVal', '_TokenType:x'],
            '_TokenType:y': ['_TokenType:yVal', '_TokenType:y']
        }
        
        for _ in range(num_to_modify):
            variable_indices = [i for i, token in enumerate(augmented_tokens) 
                               if any(token.startswith(prefix) for prefix in variable_prefixes)]
            if not variable_indices:
                break
                
            idx = random.choice(variable_indices)
            token = augmented_tokens[idx]
            
            matching_prefix = next((prefix for prefix in variable_prefixes if token.startswith(prefix)), None)
            if matching_prefix and matching_prefix in synonym_mapping:
                augmented_tokens[idx] = random.choice(synonym_mapping[matching_prefix])
    
    return ' '.join(augmented_tokens)

In [8]:
def generate_augmented_pairs(token_pairs, labels, augmentation_factor=2):
    """
    Generates augmented token pairs and corresponding labels.
    
    Args:
        token_pairs (list): Original token pairs.
        labels (list): Original labels.
        augmentation_factor (int): Number of augmentations per original sample.
        
    Returns:
        tuple: (augmented_pairs, augmented_labels)
    """
    augmentation_methods = ['shuffle', 'drop', 'duplicate', 'synonym']
    augmented_pairs = []
    augmented_labels = []
    
    for pair, label in zip(token_pairs, labels):
        augmented_pairs.append(pair)
        augmented_labels.append(label)
        
        parts = pair.split(' ', 1)
        if len(parts) < 2:
            continue  
        
        token_seq1, token_seq2 = parts
        
        for _ in range(augmentation_factor):
            method1 = random.choice(augmentation_methods)
            method2 = random.choice(augmentation_methods)
            
            aug_seq1 = augment_token_sequence(token_seq1, method1, ratio=0.1)
            aug_seq2 = augment_token_sequence(token_seq2, method2, ratio=0.1)
            
            augmented_pair = f"{aug_seq1} {aug_seq2}"
            
            augmented_pairs.append(augmented_pair)
            augmented_labels.append(label)
    
    return augmented_pairs, augmented_labels

In [9]:
print(f"Original dataset size: {len(token_pairs)}")
augmented_token_pairs, augmented_labels = generate_augmented_pairs(token_pairs, labels, augmentation_factor=2)
print(f"Augmented dataset size: {len(augmented_token_pairs)}")

original_unique, original_counts = np.unique(labels, return_counts=True)
augmented_unique, augmented_counts = np.unique(augmented_labels, return_counts=True)

print(f"\nOriginal class distribution:")
for label, count in zip(original_unique, original_counts):
    print(f"Label {label}: {count} samples ({count/len(labels)*100:.2f}%)")

print(f"\nAugmented class distribution:")
for label, count in zip(augmented_unique, augmented_counts):
    print(f"Label {label}: {count} samples ({count/len(augmented_labels)*100:.2f}%)")

Original dataset size: 911
Augmented dataset size: 2733

Original class distribution:
Label 0: 660 samples (72.45%)
Label 1: 251 samples (27.55%)

Augmented class distribution:
Label 0: 1980 samples (72.45%)
Label 1: 753 samples (27.55%)
Augmented dataset size: 2733

Original class distribution:
Label 0: 660 samples (72.45%)
Label 1: 251 samples (27.55%)

Augmented class distribution:
Label 0: 1980 samples (72.45%)
Label 1: 753 samples (27.55%)


## Similitud de Cosenos

La similitud de cosenos es una medida de similitud entre dos vectores no nulos que mide el coseno del ángulo entre ellos. En el contexto de detección de plagio, se utiliza para comparar la similitud entre dos representaciones vectoriales de documentos (en este caso, códigos fuente).

A continuación se implementa una función para calcular la similitud de cosenos entre dos vectores, inspirada en la implementación del archivo act43.py:

In [10]:
def calculate_cosine_similarity(vec1, vec2):
    """
    Calculates cosine similarity between two vectors.
    
    Args:
        vec1, vec2 (numpy.array): The vectors to compare
        
    Returns:
        float: Cosine similarity score between 0 and 1
    """
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    
    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0.0
    
    similarity = np.dot(vec1, vec2) / (norm_vec1 * norm_vec2)
    return similarity

vec1 = np.array([1, 0, 1, 1])
vec2 = np.array([1, 1, 0, 1])
print(f"Cosine similarity between test vectors: {calculate_cosine_similarity(vec1, vec2):.4f}")

Cosine similarity between test vectors: 0.6667


## Implementación manual de TF-IDF

A continuación, implementamos una función para calcular TF-IDF de manera manual, similar a la implementación en act43.py. Esto nos permitirá comparar los resultados con la implementación automatizada de scikit-learn que usamos en nuestro modelo principal.

In [11]:
def compute_manual_tfidf(code1, code2):
    """
    Computes TF-IDF vectors for two code samples manually, similar to act43.py approach.
    
    Args:
        code1, code2 (str): Java source code strings
        
    Returns:
        dict: Dictionary with TF-IDF vectors for both code samples
    """
    tokens1 = extract_tokens(code1).split()
    tokens2 = extract_tokens(code2).split()
    
    unique_tokens = list(set(tokens1 + tokens2))
    
    counts1 = Counter(tokens1)
    counts2 = Counter(tokens2)
    
    tfidf_vec1 = np.zeros(len(unique_tokens))
    tfidf_vec2 = np.zeros(len(unique_tokens))
    
    for i, token in enumerate(unique_tokens):
        tf1 = counts1[token] / max(len(tokens1), 1)
        tf2 = counts2[token] / max(len(tokens2), 1)
        
        doc_count = (counts1[token] > 0) + (counts2[token] > 0)
        
        idf = np.log(2 / doc_count) + 1
        
        tfidf_vec1[i] = tf1 * idf
        tfidf_vec2[i] = tf2 * idf
    
    result = {
        'q1_vec': tfidf_vec1,
        'q2_vec': tfidf_vec2,
        'tokens': unique_tokens
    }
    
    return result

In [12]:
def compare_code_similarity_tfidf(id1, id2, code1, code2):
    """
    Compares two code samples using manual TF-IDF implementation and cosine similarity.
    
    Args:
        id1, id2 (str): Identifiers for the code samples
        code1, code2 (str): Java source code strings
        
    Returns:
        float: Cosine similarity score between the TF-IDF vectors
    """
    tfidf_dict = compute_manual_tfidf(code1, code2)
    
    similarity = calculate_cosine_similarity(tfidf_dict['q1_vec'], tfidf_dict['q2_vec'])
    
    print(f"Similarity between {id1} and {id2}: {similarity:.4f}")
    
    if len(tfidf_dict['tokens']) > 0 and similarity > 0.5:
        contribution = tfidf_dict['q1_vec'] * tfidf_dict['q2_vec']
        top_indices = np.argsort(contribution)[-10:] if len(contribution) >= 10 else np.argsort(contribution)
        top_tokens = [(tfidf_dict['tokens'][i], contribution[i]) for i in top_indices]
        print("Top contributing tokens:")
        for token, score in reversed(top_tokens):
            if score > 0:
                print(f"{token}: {score:.4f}")
    
    return similarity

In [13]:
sample_size = min(5, len(java_files_data) // 2)
print(f"Testing with {sample_size} sample pairs...\n")

for i in range(0, sample_size*2, 2):
    try:
        id1, code1 = java_files_data[i]
        id2, code2 = java_files_data[i+1]
        
        print(f"\nPair {i//2 + 1}:")
        similarity = compare_code_similarity_tfidf(id1, id2, code1, code2)
        
        if (id1, id2) in labels_dict:
            verdict = labels_dict[(id1, id2)]
            print(f"Actual verdict: {'Plagiarism' if verdict == 1 else 'No plagiarism'}")
        elif (id2, id1) in labels_dict:
            verdict = labels_dict[(id2, id1)]
            print(f"Actual verdict: {'Plagiarism' if verdict == 1 else 'No plagiarism'}")
        else:
            print("No verdict available for this pair")
    except IndexError:
        break

Testing with 5 sample pairs...


Pair 1:
Similarity between 0017d438 and 9852706b: 0.6414
Top contributing tokens:
_TokenType:=: 0.0095
_TokenType:int: 0.0059
_TokenType:i: 0.0017
_TokenType:<: 0.0016
_TokenType:+: 0.0015
_TokenType:>: 0.0011
_TokenType:nextInt: 0.0009
_TokenType:import: 0.0009
_TokenType:new: 0.0007
_TokenType:static: 0.0006
Actual verdict: Plagiarism

Pair 2:
Similarity between 0017d438 and ac180326: 0.6045
Top contributing tokens:
_TokenType:=: 0.0073
_TokenType:int: 0.0035
_TokenType:+: 0.0026
_TokenType:i: 0.0021
_TokenType:<: 0.0015
_TokenType:]: 0.0014
_TokenType:[: 0.0014
_TokenType:import: 0.0013
_TokenType:new: 0.0005
_TokenType:Pair: 0.0004
Actual verdict: No plagiarism

Pair 3:
Similarity between 0048a372 and 0adb1ee5: 0.7518
Top contributing tokens:
_TokenType:i: 0.0089
_TokenType:=: 0.0083
_TokenType:]: 0.0043
_TokenType:[: 0.0043
_TokenType:+: 0.0022
_TokenType:int: 0.0017
_TokenType:long: 0.0009
_TokenType:n: 0.0007
_TokenType:<: 0.0006
_TokenType:k: 0.

## Implementación de TF-IDF con scikit-learn

A continuación implementaremos la similitud de cosenos utilizando TF-IDF con scikit-learn, para comparar con nuestra implementación manual anterior basada en act43.py y con el modelo de machine learning.

In [None]:
def get_sklearn_tfidf_similarity(code1, code2):
    """
    Calculate TF-IDF based similarity between two code samples using scikit-learn.
    
    Args:
        code1, code2 (str): Source code strings
        
    Returns:
        float: Similarity score
        dict: Dictionary with TF-IDF vectors and feature names
    """
    # Preprocess code
    processed_code1 = extract_tokens(code1)
    processed_code2 = extract_tokens(code2)
    
    # Create corpus
    corpus = [processed_code1, processed_code2]
    
    # Initialize and fit TF-IDF vectorizer
    vectorizer = TfidfVectorizer(lowercase=True)
    tfidf_matrix = vectorizer.fit_transform(corpus)
    
    # Get feature names
    feature_names = vectorizer.get_feature_names_out()
    
    # Convert sparse matrix to dense arrays
    vec1 = tfidf_matrix[0].toarray().flatten()
    vec2 = tfidf_matrix[1].toarray().flatten()
    
    # Calculate similarity (1 - cosine distance)
    if np.all(vec1 == 0) or np.all(vec2 == 0):
        similarity = 0.0
    else:
        similarity = 1 - cosine(vec1, vec2)
    
    # Prepare return dictionary
    result = {
        'q1_vec': vec1,
        'q2_vec': vec2,
        'tokens': feature_names,
        'similarity': similarity
    }
    
    return similarity, result

In [None]:
def get_top_contributing_tokens(tfidf_result, top_n=10):
    """
    Find the tokens that contribute most to the similarity score
    
    Args:
        tfidf_result (dict): Dictionary with TF-IDF vectors and feature names
        top_n (int): Number of top tokens to return
        
    Returns:
        list: List of (token, contribution_score) tuples
    """
    tokens = tfidf_result['tokens']
    vec1 = tfidf_result['q1_vec']
    vec2 = tfidf_result['q2_vec']
    
    # Calculate contribution as product of corresponding values in both vectors
    contribution = vec1 * vec2
    
    # Get indices of top contributors
    top_indices = contribution.argsort()[-top_n:][::-1]
    
    # Return token and contribution pairs
    top_tokens = [(tokens[i], contribution[i]) for i in top_indices if contribution[i] > 0]
    
    return top_tokens

## Comparación entre implementación manual y scikit-learn

Ahora compararemos la implementación manual de TF-IDF (basada en act43.py) con la implementación utilizando scikit-learn para los mismos pares de código.

In [None]:
def compare_implementations(id1, id2, code1, code2):
    """
    Compare manual and scikit-learn TF-IDF implementations for code similarity.
    
    Args:
        id1, id2 (str): Identifiers for the code samples
        code1, code2 (str): Java source code strings
    """
    print(f"Comparing implementations for {id1} and {id2}")
    print("=" * 60)
    
    # Manual implementation (act43.py style)
    manual_tfidf_dict = compute_manual_tfidf(code1, code2)
    manual_similarity = calculate_cosine_similarity(manual_tfidf_dict['q1_vec'], manual_tfidf_dict['q2_vec'])
    print(f"Manual TF-IDF similarity: {manual_similarity:.4f}")
    
    # scikit-learn implementation
    sklearn_similarity, sklearn_result = get_sklearn_tfidf_similarity(code1, code2)
    print(f"scikit-learn TF-IDF similarity: {sklearn_similarity:.4f}")
    
    # Get top contributing tokens from scikit-learn implementation
    top_tokens = get_top_contributing_tokens(sklearn_result)
    
    # Check if we have labels for this pair
    if (id1, id2) in labels_dict:
        verdict = labels_dict[(id1, id2)]
        print(f"Actual verdict: {'Plagiarism' if verdict == 1 else 'No plagiarism'}")
    elif (id2, id1) in labels_dict:
        verdict = labels_dict[(id2, id1)]
        print(f"Actual verdict: {'Plagiarism' if verdict == 1 else 'No plagiarism'}")
    else:
        print("No verdict available for this pair")
        verdict = None
    
    # Print top contributing tokens
    if len(top_tokens) > 0:
        print("\nTop tokens contributing to similarity (scikit-learn):")
        for token, score in top_tokens:
            print(f"  {token}: {score:.6f}")
    
    return {
        'manual_similarity': manual_similarity,
        'sklearn_similarity': sklearn_similarity,
        'verdict': verdict
    }

In [None]:
sample_size = min(5, len(java_files_data) // 2)
print(f"Comparing implementations on {sample_size} sample pairs...\n")

comparison_results = []

for i in range(0, sample_size*2, 2):
    try:
        id1, code1 = java_files_data[i]
        id2, code2 = java_files_data[i+1]
        
        print(f"\nPair {i//2 + 1}:")
        result = compare_implementations(id1, id2, code1, code2)
        comparison_results.append(result)
        
    except IndexError:
        break
        
# Convert results to DataFrame for analysis
comparison_df = pd.DataFrame(comparison_results)
print("\nSummary of comparison results:")
print(comparison_df)

## Comparación de scikit-learn TF-IDF con el modelo de ML

Ahora compararemos los resultados de la similitud de cosenos TF-IDF utilizando scikit-learn con los resultados de nuestro modelo de machine learning para ver qué enfoque proporciona mejores resultados.

In [None]:
def compare_tfidf_with_model(model, test_size=10):
    """
    Compare TF-IDF similarity with ML model predictions.
    
    Args:
        model: Trained ML model
        test_size (int): Number of pairs to test
    """
    # Use a subset of data for testing
    test_indices = np.random.choice(len(java_files_data)//2, test_size, replace=False)
    
    results = []
    
    for idx in test_indices:
        try:
            id1, code1 = java_files_data[idx*2]
            id2, code2 = java_files_data[idx*2 + 1]
            
            # Calculate TF-IDF similarity
            tfidf_similarity, _ = get_sklearn_tfidf_similarity(code1, code2)
            
            # Process for model input
            t1 = extract_tokens(code1)
            t2 = extract_tokens(code2)
            tokens_pair = f"{t1} {t2}"
            
            # Transform using the same vectorizer
            pair_vector = vectorizer.transform([tokens_pair]).toarray()
            
            # Get model prediction
            model_prediction = model.predict(pair_vector)[0]
            model_prob = model.predict_proba(pair_vector)[0][1] if hasattr(model, 'predict_proba') else None
            
            # Get actual label if available
            if (id1, id2) in labels_dict:
                actual = labels_dict[(id1, id2)]
            elif (id2, id1) in labels_dict:
                actual = labels_dict[(id2, id1)]
            else:
                actual = None
                
            results.append({
                'pair_id': f"{id1}_{id2}",
                'tfidf_similarity': tfidf_similarity,
                'model_prediction': model_prediction,
                'model_probability': model_prob,
                'actual': actual
            })
            
        except Exception as e:
            print(f"Error processing pair at index {idx}: {e}")
    
    return pd.DataFrame(results)

In [None]:
# Choose the best model for comparison (Random Forest in this example)
best_model = rf_model

# Compare with TF-IDF
comparison_results = compare_tfidf_with_model(best_model, test_size=15)
print(comparison_results)

# Calculate accuracy of TF-IDF with threshold 0.7
valid_results = comparison_results.dropna(subset=['actual'])
if not valid_results.empty:
    tfidf_predictions = (valid_results['tfidf_similarity'] > 0.7).astype(int)
    tfidf_accuracy = (tfidf_predictions == valid_results['actual']).mean()
    model_accuracy = (valid_results['model_prediction'] == valid_results['actual']).mean()
    
    print(f"\nTF-IDF accuracy with threshold 0.7: {tfidf_accuracy:.4f}")
    print(f"Model accuracy: {model_accuracy:.4f}")

## Visualización de resultados

Visualizamos la comparación entre la similitud TF-IDF y las predicciones del modelo para entender mejor la relación entre ambos enfoques.

In [None]:
plt.figure(figsize=(10, 6))

# Filter for rows with actual labels
plot_df = comparison_results.dropna(subset=['actual'])

if not plot_df.empty:
    # Plot points colored by actual label
    scatter = plt.scatter(plot_df['tfidf_similarity'], 
              plot_df['model_probability'] if 'model_probability' in plot_df.columns else plot_df['model_prediction'],
              c=plot_df['actual'], 
              cmap='coolwarm', 
              s=100,
              alpha=0.7)
    
    plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
    plt.axvline(x=0.7, color='gray', linestyle='--', alpha=0.5)
    
    plt.xlabel('TF-IDF Cosine Similarity')
    plt.ylabel('Model Probability of Plagiarism' if 'model_probability' in plot_df.columns else 'Model Prediction')
    plt.title('Comparison of TF-IDF Similarity vs Model Predictions')
    
    # Add a legend
    legend1 = plt.legend(*scatter.legend_elements(),
                        title="Actual Label")
    plt.gca().add_artist(legend1)
    
    # Annotate the quadrants
    plt.text(0.35, 0.75, 'Model: Yes\nTF-IDF: No', ha='center', va='center', alpha=0.7)
    plt.text(0.85, 0.75, 'Both: Yes\n(True Positive)', ha='center', va='center', alpha=0.7)
    plt.text(0.35, 0.25, 'Both: No\n(True Negative)', ha='center', va='center', alpha=0.7)
    plt.text(0.85, 0.25, 'Model: No\nTF-IDF: Yes', ha='center', va='center', alpha=0.7)
    
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("No labeled data available for visualization")

In [None]:
# voting_model_final_refined.fit(X_train_resampled, y_train_resampled)
# final_result = evaluate_model('Ensemble Final', voting_model_final_refined, X_test, y_test)

# print(f"Accuracy: {final_result['accuracy']:.4f}")
# print(f"Precision: {final_result['precision']:.4f}")
# print(f"Recall: {final_result['recall']:.4f}")
# print(f"F1-Score: {final_result['f1']:.4f}")
# print(f"MCC: {final_result['mcc']:.4f}")