# Combine N-Gram Paraphrasing Results and Score Using Idiolect

This is a notebook to combine the results for n-gram paraphrasing using several models as scoring models. In this notebook we use the Idiolect R library and specifically the **performance** function to calculate the scores.

**NOTE**: The idea here is that the scores are already calculate and aggregated for each model and that we just need to import and score. Trying to aggregate using R caused some joining issues since n-grams can begin with a space.

### Load Libraries

In [1]:
source("./utils.R")

In [2]:
# Load while suppressing all warnings
suppressWarnings(
  suppressPackageStartupMessages(
    {
      library(dplyr)
      library(idiolect)
      library(readr)
      library(readxl)
      library(writexl)
      library(purrr)
    }
  )
)

### Load Data

Here we load the data, again can see we're not pointing at directories, rather results that have already been aggregated.

In [3]:
base_location = '/Volumes/BCross/paraphrase examples slurm/Wiki-Test'

# Token Size Problems
# This table contains the problems for each different min_token_size value in the dataset
token_size_problems = read_excel(paste0(base_location, '/token_size_problems.xlsx'))

# Raw Score Data
# This data contains the llr scores aggregated across problems with averaging across phrase occurences done first
# raw_score_data = read_excel(paste0(base_location, '/score_by_token_size_avg.xlsx'))
raw_score_data = read_excel(paste0(base_location, '/combined_token_level_results_agg_v3.xlsx'))

# LambdaG Results
# Load the LambdaG results for the Wiki test dataset
lambdag_raw <- read.csv(paste0(base_location, '/lambdaG_results.csv'))

In [4]:
raw_score_data

[38;5;246m# A tibble: 5,811 × 18[39m
   paraphrasing_model scoring_model problem      corpus known_author unknown_author
   [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<chr>[39m[23m         [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<chr>[39m[23m  [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<chr>[39m[23m         
[38;5;250m 1[39m ModernBERT-base    gpt2          HOOTmag vs … Wiki   HOOTmag      HOOTmag       
[38;5;250m 2[39m ModernBERT-base    gpt2          HOOTmag vs … Wiki   HOOTmag      Iain99        
[38;5;250m 3[39m ModernBERT-base    gpt2          Hodja_Nasre… Wiki   Hodja_Nasre… Hodja_Nasredd…
[38;5;250m 4[39m ModernBERT-base    gpt2          Hodja_Nasre… Wiki   Hodja_Nasre… HonestopL     
[38;5;250m 5[39m ModernBERT-base    gpt2          HonestopL v… Wiki   HonestopL    HOOTmag       
[38;5;250m 6[39m ModernBERT-base    gpt2          HonestopL v… Wiki   HonestopL    HonestopL     
[38;5;250m 7[39m ModernBERT-base    gpt2 

### Create Final Dataset

Here we join the raw data with teh problem dataset to filter out incorrect token sizes.

In [5]:
# We want to rename the unkown score column to just score to allow it to work with performance
score_data = raw_score_data %>%
  inner_join(token_size_problems, by = c('problem', 'min_token_size', 'corpus', 'target')) %>%
  rename('score'='unknown_log_prob')

score_data %>% head()

[38;5;246m# A tibble: 6 × 18[39m
  paraphrasing_model scoring_model problem       corpus known_author unknown_author
  [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<chr>[39m[23m         [3m[38;5;246m<chr>[39m[23m         [3m[38;5;246m<chr>[39m[23m  [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<chr>[39m[23m         
[38;5;250m1[39m ModernBERT-base    gpt2          HOOTmag vs H… Wiki   HOOTmag      HOOTmag       
[38;5;250m2[39m ModernBERT-base    gpt2          HOOTmag vs I… Wiki   HOOTmag      Iain99        
[38;5;250m3[39m ModernBERT-base    gpt2          Hodja_Nasred… Wiki   Hodja_Nasre… Hodja_Nasredd…
[38;5;250m4[39m ModernBERT-base    gpt2          Hodja_Nasred… Wiki   Hodja_Nasre… HonestopL     
[38;5;250m5[39m ModernBERT-base    gpt2          HonestopL vs… Wiki   HonestopL    HOOTmag       
[38;5;250m6[39m ModernBERT-base    gpt2          HonestopL vs… Wiki   HonestopL    HonestopL     
[38;5;246m# ℹ 12 more variables: target <lgl>,

### Calculate Performance

In [6]:
distinct_model_sizes <- score_data %>%
  select(paraphrasing_model, scoring_model, corpus, min_token_size) %>%
  distinct() %>%
  arrange(paraphrasing_model, scoring_model, corpus, min_token_size)

distinct_model_sizes %>% head()

[38;5;246m# A tibble: 6 × 4[39m
  paraphrasing_model scoring_model corpus min_token_size
  [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<chr>[39m[23m         [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m ModernBERT-base    gpt2          Wiki                2
[38;5;250m2[39m ModernBERT-base    gpt2          Wiki                3
[38;5;250m3[39m ModernBERT-base    gpt2          Wiki                4
[38;5;250m4[39m ModernBERT-base    gpt2          Wiki                5
[38;5;250m5[39m ModernBERT-large   gpt2          Wiki                2
[38;5;250m6[39m ModernBERT-large   gpt2          Wiki                3

In [7]:
process_group <- function(paraphrasing_model, scoring_model, corpus, min_token_size) {

  distinct_problems <- token_size_problems %>%
    filter(min_token_size == !!min_token_size)

  # Filter score_data by the combination
  filtered <- score_data %>%
    filter(paraphrasing_model == !!paraphrasing_model,
           scoring_model == !!scoring_model,
           corpus == !!corpus,
           min_token_size == !!min_token_size) %>%
    inner_join(distinct_problems, by=c('corpus', 'problem', 'min_token_size', 'target'))
  
  # Run performance function
  perf <- performance(filtered)
  perf <- perf$evaluation

  # Add the identifying columns
  cbind(
    data.frame(paraphrasing_model = paraphrasing_model,
               scoring_model=scoring_model,
               corpus = corpus,
               min_token_size = min_token_size),
    perf
  )
}

In [8]:
results <- distinct_model_sizes %>%
  pmap_dfr(process_group)

results %>% arrange(corpus, min_token_size, paraphrasing_model, scoring_model,) %>% head()

Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FA

  paraphrasing_model scoring_model corpus min_token_size      Cllr  Cllr_min
1    ModernBERT-base          gpt2   Wiki              2 0.8596104 0.7608662
2   ModernBERT-large          gpt2   Wiki              2 0.8777187 0.7732053
3               gpt5         gemma   Wiki              2 0.8856810 0.8033393
4               gpt5          gpt2   Wiki              2 0.8456774 0.7535165
5               gpt5         llama   Wiki              2 0.8878195 0.8066612
6               gpt5          qwen   Wiki              2 0.8499285 0.7611920
       EER Mean TRUE LLR Mean FALSE LLR TRUE trials FALSE trials       AUC
1 27.97534     0.3504266     -0.2351931         114          114 0.7696110
2 29.38808     0.2973148     -0.2055614         106          107 0.7494505
3 31.23065     0.2539625     -0.1932376         114          114 0.7472098
4 28.20513     0.4098410     -0.2536523         114          114 0.7797353
5 31.05023     0.2469576     -0.1902074         114          114 0.7445791
6 28.57143 

### LambdaG Results

In [9]:
# Get the distinct corpus and min_token_size adding LambdaG as model at front
distinct_corpus_sizes <- distinct_model_sizes %>%
  select(corpus, min_token_size) %>%
  distinct() %>%
  arrange(corpus, min_token_size) %>%
  mutate(paraphrasing_model = "LambdaG", scoring_model = "LambdaG") %>%
  relocate(c(paraphrasing_model, scoring_model), .before = everything())

In [10]:
distinct_corpus_sizes

[38;5;246m# A tibble: 4 × 4[39m
  paraphrasing_model scoring_model corpus min_token_size
  [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<chr>[39m[23m         [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m LambdaG            LambdaG       Wiki                2
[38;5;250m2[39m LambdaG            LambdaG       Wiki                3
[38;5;250m3[39m LambdaG            LambdaG       Wiki                4
[38;5;250m4[39m LambdaG            LambdaG       Wiki                5

In [11]:
lambdag_raw

                                           problem          known_author
1                               HOOTmag vs HOOTmag               HOOTmag
2                               Icarus3 vs Icarus3               Icarus3
3                               Rjecina vs Rjecina               Rjecina
4                               Lear_21 vs Lear_21               Lear_21
5                     Richard_Daft vs Richard_Daft          Richard_Daft
6                               Rjecina vs RolandR               Rjecina
7                             Pro-Lick vs Pro-Lick              Pro-Lick
8                         Jim_Hardie vs Jim_Hardie            Jim_Hardie
9                               Snowded vs Snowded               Snowded
10                         WIKI-GUY-16 vs WilliamH           WIKI-GUY-16
11                                Yoenit vs Yoenit                Yoenit
12                                KBlott vs KBlott                KBlott
13                               SKS2K6 vs Snowded 

In [12]:
process_group_lambdag <- function(paraphrasing_model, scoring_model, corpus, min_token_size) {
  "Function to process the lambdaG results"

  # Filter score_data by the combination
  problems <- token_size_problems %>%
    filter(corpus == !!corpus,
           min_token_size == !!min_token_size)
  
  filtered_lambdag <- lambdag_raw %>%
    inner_join(problems, by=c('problem', 'target'))

  # Run your performance function (assume it returns a 1-row data frame)
  perf <- performance(filtered_lambdag)
  perf <- perf$evaluation

  # Add the identifying columns
  cbind(
    data.frame(paraphrasing_model = paraphrasing_model,
               scoring_model = scoring_model,
               corpus = corpus,
               min_token_size = min_token_size),
    perf
  )
}

In [13]:
results_lambdag <- distinct_corpus_sizes %>%
  pmap_dfr(process_group_lambdag)

Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases
Setting levels: control = FALSE, case = TRUE
Setting direction: controls < cases


In [14]:
results_lambdag %>% head()

  paraphrasing_model scoring_model corpus min_token_size      Cllr  Cllr_min
1            LambdaG       LambdaG   Wiki              2 0.6476871 0.5243020
2            LambdaG       LambdaG   Wiki              3 0.6374915 0.5171254
3            LambdaG       LambdaG   Wiki              4 0.6061458 0.4756306
4            LambdaG       LambdaG   Wiki              5 0.6872042 0.4197365
       EER Mean TRUE LLR Mean FALSE LLR TRUE trials FALSE trials       AUC
1 16.66667      1.092086     -0.7585601         114          114 0.9011480
2 16.44444      1.139849     -0.7887780         113          112 0.9049140
3 14.59695      1.331042     -0.8316408          92           55 0.9171908
4 15.33742      1.709290     -0.6700987          50           12 0.8895833
  Balanced Accuracy Precision    Recall        F1 TP FN FP TN
1         0.8303571 0.8303571 0.8303571 0.8303571 93 19 19 93
2         0.8325962 0.8363636 0.8288288 0.8325792 92 19 18 92
3         0.8545073 0.9250000 0.8222222 0.8705882 74 1

### Combine Results with LambdaG Results

In [15]:
results_combined <- rbind(results, results_lambdag) %>%
  arrange(corpus, min_token_size, paraphrasing_model, scoring_model)

In [16]:
results_combined %>% head(12)

   paraphrasing_model scoring_model corpus min_token_size      Cllr  Cllr_min
1             LambdaG       LambdaG   Wiki              2 0.6476871 0.5243020
2     ModernBERT-base          gpt2   Wiki              2 0.8596104 0.7608662
3    ModernBERT-large          gpt2   Wiki              2 0.8777187 0.7732053
4                gpt5         gemma   Wiki              2 0.8856810 0.8033393
5                gpt5          gpt2   Wiki              2 0.8456774 0.7535165
6                gpt5         llama   Wiki              2 0.8878195 0.8066612
7                gpt5          qwen   Wiki              2 0.8499285 0.7611920
8             LambdaG       LambdaG   Wiki              3 0.6374915 0.5171254
9     ModernBERT-base          gpt2   Wiki              3 0.7933321 0.6895292
10   ModernBERT-large          gpt2   Wiki              3 0.8119625 0.7151030
11               gpt5         gemma   Wiki              3 0.7703894 0.6627526
12               gpt5          gpt2   Wiki              3 0.7630

In [17]:
results_combined %>%
  write_xlsx(paste0(base_location, "/idiolect_token_results_summary_avg_logprobs.xlsx"))