# Decision Tree Classifier for a plagiarism detector

## Building the data dataframe

In [1]:
# Load modules and packages
import pandas as pd
import numpy as np

df = pd.read_csv("../data/subset.csv")
df.columns = ["src", "sus", "jaccard", "containment", "dep", "plagiarized"]

# Purge dataframe from erros introduced when writing the csv 
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]

print(f"Plagiarized = {len(df[df['plagiarized'] == 1])}")
print(f"Non-plagiarized = {len(df[df['plagiarized'] == 0])}")

Plagiarized = 683
Non-plagiarized = 330567


# Training a tree classificator

In [27]:
from bokeh.plotting import *

# Tell bokeh where to output
output_notebook()

Tenemos un problema con los datos, por la propia naturaleza del problema, hay muchos más casos de documentos original que de plagiados. Si no lo tenemeos en cuenta, dado que el calisificador busca el mayor acierto posible tomará la ruta más fácil que es decir siempre que no se trata de un documento plagiado. Dado que se trata de un prueba de concepto, simplemente recortamos nuestra muestre e introducimos en los datos para el entrenamiento y validación la misma cantidad de entradas de documentos original como plagiados.

Otra solución que se podría explorar es la asignación de pesos a los datos.

In [3]:
# Balance dataframe by undersampling, we assume naively that they should be in equal
n = len(df[df['plagiarized'] == 1])
data = df[df['plagiarized'] == 1].append(df[df['plagiarized'] == 0].sample(n = n))
data.reset_index(inplace = True)

# Select data to model
features = ["jaccard", "containment", "dep"]
X = data[features]
y = data["plagiarized"]

A continuación comprobaremos como de bien se desenvuelve un clasificador base con nuestros datos. Para medir esto usaremos el puntuador F1, ya que nos da una buena idea de la precision a la hora de clasificar cada tipo de documento. Para ello, usaremos K-Folds, ya que no tenemos tantos datos como para poder dividirls directamente en test y validación

In [221]:
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import KFold
import numpy as np

def eval_model_table(X, y, clf):
    # Make folds of data for cross validation
    k_fold = KFold(n_splits = 10, shuffle = True)

    results = []
    for train, test in k_fold.split(X):
        clf.fit(X.iloc[train], y.iloc[train])
        sc = score(y.iloc[test], clf.predict(X.iloc[test]), average = None, labels = [0, 1])
        results.append(np.stack(sc, axis = 0))

    avg_results = sum(results) / len(results)
    std_results = [np.abs(x - avg_results)**2 for x in results]
    std_results = np.sqrt(sum(std_results) / (len(std_results) * np.sqrt(len(std_results))))
    
    # Calculation of scores
    scores = pd.DataFrame(np.concatenate((avg_results, std_results), axis = 1))
    scores.columns = ["Avg. non-plagiarized", "Avg. plagiarized", "Std. non-plagiarized", "Std. plagiarized"]
    scores.insert(0, "Score type", ["precision", "recall", "F1", "support"])
    scores.set_index("Score type", inplace = True)
    
    return(scores)

# Init classifier, maybe SV could be useful?
clf = DecisionTreeClassifier()
eval_model_table(X, y, clf)

Unnamed: 0_level_0,Avg. non-plagiarized,Avg. plagiarized,Std. non-plagiarized,Std. plagiarized
Score type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
precision,0.666925,0.668296,0.024214,0.017017
recall,0.664029,0.670102,0.031301,0.01735
F1,0.66489,0.668527,0.025375,0.012329
support,68.3,68.3,2.565287,2.477495


Como podemos ver en la tabla, nos encontramos en torno a un 66%. No está mal para una prueba, pero veamos si podemos depurar el modelo.

## Optimización de la profundidad máxima del arbol

Uno de los parametros que podemos cambiar es la produnidad máxima del arbol. Es importante no elegir un numero muy pequeño porque no podremos llegar a ajustar correctamente, ni muy alto porque sobreajustariamos y el modelo perdería generalidad. Para ver como impacta es parametro en F1, hacemos un plot y vemos si hay algun tipo de maximos o tendencias.

In [223]:
# Lets try to optimize max_depth
from sklearn.metrics import f1_score
from statsmodels.nonparametric.smoothers_lowess import lowess


def calc_F1_depth(depth):
    clf = DecisionTreeClassifier(max_depth = depth)

    # Make folds of data for cross validation
    k_fold = KFold(n_splits = 10, shuffle = True)

    results = []
    for train, test in k_fold.split(X):
        clf.fit(X.iloc[train], y.iloc[train])
        results.append(f1_score(y.iloc[test], clf.predict(X.iloc[test]), average = None))
    
    return sum(results) / len(results)
        
depth = np.arange(1, 100, 1)
F1 = np.stack(list(map(calc_F1_depth, depth.tolist())), axis = 0)

depth_plot = figure(title = "F1 score as a function of the maximum depth of the tree",
                   x_axis_label = "Depth",
                   y_axis_label = "F1",
                   sizing_mode = "stretch_width")

depth_plot.line(depth, F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
depth_plot.line(depth, F1[:, 1], legend_label = "Plagiarized", line_color = "orange")
show(depth_plot)


# At some point, although the max_depth is changed, the scores vary too much around the mean
# Doesn't seem like its worth it to specify it

Como podemos ver, aunque localmente el resultado varie considerablemente, llega un momento donde oscila en torno al mismo valor. Dejaremos que el propio clasificador use el numero de niveles que la haga falta.

## Optimización de la proporcion de datos

Como hemos dicho antes, nuestros datos no estan balanceados. Lo que haremos a continuación es estudiar el impacto de tener más o menos datos de casos que no son plagia en relación a los que si lo son.

In [107]:
# Lets try to optmize by changing proportion of data
def calc_F1_prop(p):
    n = len(df[df['plagiarized'] == 1])
    m = round(p * n)
    data = df[df['plagiarized'] == 1].append(df[df['plagiarized'] == 0].sample(n = m))
    data.reset_index(inplace = True)
    
    features = ["jaccard", "containment", "dep"]
    X = data[features]
    y = data["plagiarized"]
    
    clf = DecisionTreeClassifier()

    # Make folds of data for cross validation
    k_fold = KFold(n_splits = 10, shuffle = True)

    results = []
    for train, test in k_fold.split(data):
        clf.fit(X.iloc[train], y.iloc[train])
        results.append(f1_score(y.iloc[test], clf.predict(X.iloc[test]), average = None))
    
    return sum(results) / len(results)

p = np.linspace(0.8, 3, num = 50)
F1 = np.stack(list(map(calc_F1_prop, p.tolist())), axis = 0)
# Filter F1
filtered_F1 = np.column_stack((lowess(F1[:, 0], p, is_sorted=True, frac=0.2, it=0)[:, 1],
                             lowess(F1[:, 1], p, is_sorted=True, frac=0.2, it=0)[:, 1]))

# Normal plot
prop_plot = figure(title = "F1 score as a function of the proportion of texts in the data",
                   x_axis_label = "Relative proportion of non-plagiared to plagiarized",
                   y_axis_label = "F1")
prop_plot.line(p, F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
prop_plot.line(p, F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


# Filtered plot to remove some noise and better appreciate tendencies
prop_filtered_plot = figure(title = "F1 score as a function of the proportion of texts in the data",
                           x_axis_label = "Relative proportion of non-plagiared to plagiarized",
                           y_axis_label = "F1")

prop_filtered_plot.line(p, filtered_F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
prop_filtered_plot.line(p, filtered_F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


show(gridplot([[prop_plot, prop_filtered_plot]], sizing_mode='scale_both'))


# The optimum point for the unbalanced data is obtained when both sets have the same proportion

Como podemos observar, el compromiso optimo para ambas clases en cuanto a precisión se encuentra en p = 1, es decir, la misma cantidad de casos de plagio que de no plagio.

## Optimización del número mínimo de muestras por hoja

En este apartado estuadiamos como varia el puntuador según la cantidad de muestras minimas que se necesitan en cada nueva hoja
para poder formar un nuevo nodo.

In [226]:
def calc_F1_leaf(leafs):
    clf = DecisionTreeClassifier(min_samples_leaf = leafs)

    # Make folds of data for cross validation
    k_fold = KFold(n_splits = 10, shuffle = True)

    results = []
    for train, test in k_fold.split(X):
        clf.fit(X.iloc[train], y.iloc[train])
        results.append(f1_score(y.iloc[test], clf.predict(X.iloc[test]), average = None))
    
    return sum(results) / len(results)
        
leafs = np.arange(1, 150, 1)
F1 = np.stack(list(map(calc_F1_leaf, leafs)), axis = 0)
filtered_F1 = np.column_stack((lowess(F1[:, 0], leafs, is_sorted=True, frac=0.06, it=0)[:, 1],
                             lowess(F1[:, 1], leafs, is_sorted=True, frac=0.06, it=0)[:, 1]))
# Normal plot
min_samples_plot = figure(title = "F1 score as a function of the minimum samples per leaf",
                   x_axis_label = "Minimum samples per leaf",
                   y_axis_label = "F1")

min_samples_plot.line(leafs, F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
min_samples_plot.line(leafs, F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


# Filtered plot to remove some noise and better appreciate tendencies
min_samples_plot_filtered = figure(title = "F1 score as a function of the minimum samples per leaf",
                           x_axis_label = "Minimum samples per leaf",
                           y_axis_label = "F1")

min_samples_plot_filtered.line(leafs, filtered_F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
min_samples_plot_filtered.line(leafs, filtered_F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


show(gridplot([[min_samples_plot, min_samples_plot_filtered]], sizing_mode='scale_both'))

# The region around 60 samples seems to be a good choice

Dado que el plot normal tiene mucho ruido, nos fijaremos en el suvizado por loess. Se puede ver una región en la que crece desde cero y la puntuación de los no plagiados cae rapidamente. Dado que no queremos mejorar mucho el reconocimiento de un tipo de document a cambio de empeorar otro, tomaremos como valor optimo el que se situa en torno a 60 muestras

## Optimización del minimo número de muestras por división

Ahora estudiaremos como afecta a la precisión el minimo numero de muestras necesario para dividir un nodo en sus hijos. En este caso lo estudiaremos como la fracción del total de las muestras que le hemos pasado al clasificador.

In [103]:
def calc_F1_split(split):
    clf = DecisionTreeClassifier(min_samples_split = split)

    # Make folds of data for cross validation
    k_fold = KFold(n_splits = 10, shuffle = True)

    results = []
    for train, test in k_fold.split(X):
        clf.fit(X.iloc[train], y.iloc[train])
        results.append(f1_score(y.iloc[test], clf.predict(X.iloc[test]), average = None))
    
    return sum(results) / len(results)
        
splits = np.linspace(0.0001, 0.4, num = 100)
F1 = np.stack(list(map(calc_F1_split, splits)), axis = 0)
filtered_F1 = np.column_stack((lowess(F1[:, 0], splits, is_sorted=True, frac=0.06, it=0)[:, 1],
                               lowess(F1[:, 1], splits, is_sorted=True, frac=0.06, it=0)[:, 1]))
# Normal plot
split_plot = figure(title = "F1 score as a function of the fraction of the minimum samples required to split",
                   x_axis_label = "fraction",
                   y_axis_label = "F1")

split_plot.line(splits, F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
split_plot.line(splits, F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


# Filtered plot to remove some noise and better appreciate tendencies
split_plot_filtered = figure(title = "F1 score as a function of the fraction of minimum samples required to split",
                            x_axis_label = "fraction",
                            y_axis_label = "F1")

split_plot_filtered.line(splits, filtered_F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
split_plot_filtered.line(splits, filtered_F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


show(gridplot([[split_plot, split_plot_filtered]], sizing_mode='scale_both'))

# A small maximum can be seen before the the scores split, pushing our score over 0.7 sometimes

Vemos un ligero crecimiento para valores inferiores a aproximadamente 0.15 y finalmente un separación abrupta, donde el reconocimiento de plagios cae.

## Optimización del desdeco minimo en la impureza

In [227]:
La impureza de un nodo mide a groso modo la homegenidad de de las etiquetas que tiene dicho nodo. El parametro que 
estamos estudiando lo que hace es ajustar si creara un nueva división en base a si la reducción en la impureza que supone es 
igual o menor a un valor establecido.

SyntaxError: invalid syntax (<ipython-input-227-027d2e21bd24>, line 1)

In [106]:
def calc_F1_impurity(imp):
    clf = DecisionTreeClassifier(min_impurity_decrease = imp)

    # Make folds of data for cross validation
    k_fold = KFold(n_splits = 10, shuffle = True)

    results = []
    for train, test in k_fold.split(X):
        clf.fit(X.iloc[train], y.iloc[train])
        results.append(f1_score(y.iloc[test], clf.predict(X.iloc[test]), average = None))
    
    return sum(results) / len(results)
        
imp = np.linspace(0, 1, num = 100)
F1 = np.stack(list(map(calc_F1_impurity, imp)), axis = 0)
filtered_F1 = np.column_stack((lowess(F1[:, 0], imp, is_sorted=True, frac=0.1, it=0)[:, 1],
                               lowess(F1[:, 1], imp, is_sorted=True, frac=0.1, it=0)[:, 1]))
# Normal plot
imp_plot = figure(title = "F1 score as a function of the minimum entropy decrease required to split",
                   x_axis_label = "fraction",
                   y_axis_label = "F1")

imp_plot.line(imp, F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
imp_plot.line(imp, F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


# Filtered plot to remove some noise and better appreciate tendencies
imp_plot_filtered = figure(title = "F1 score as a function of the minimum entropy decrease required to split",
                            x_axis_label = "fraction",
                            y_axis_label = "F1")

imp_plot_filtered.line(imp, filtered_F1[:, 0], legend_label = "Non-plagiarized", line_color = "steelblue")
imp_plot_filtered.line(imp, filtered_F1[:, 1], legend_label = "Plagiarized", line_color = "orange")


show(gridplot([[imp_plot, imp_plot_filtered]], sizing_mode='scale_both'))

# Any value that isn't zero will make our score drop quickly, no too interesting

Como podemos observar, a partir de cero cae de forma drastica y oscila para ambas categorias. Por lo tanto lo dejaremos como nulo.

## Evalución del modelo depurado

In [219]:
# Reevaluate the model with the slight optimizations found earlier
clf = DecisionTreeClassifier(min_samples_split = 0.12, min_samples_leaf = 60)
eval_model_table(X, y, clf)

Unnamed: 0_level_0,Avg. non-plagiarized,Avg. plagiarized,Std. non-plagiarized,Std. plagiarized
Score type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
precision,0.701384,0.703107,0.023799,0.034566
recall,0.696028,0.70377,0.046674,0.031212
F1,0.695705,0.699708,0.025206,0.016688
support,68.3,68.3,3.503256,3.653488


Como podemos ver, hemos mejorado (muy ligeramente) el modelo, llegando a tener un precisión cercana al 70%. Finalmente, guardamos el modelo para poder ser reusado posteriormente.

In [220]:
import pickle
# We improved a bit, lets save it and work with this one
with open("../models/tree_classifier.model", "wb") as f:
    pickle.dump(clf, f)