# Idiolect LambdaG Method

This notebook is used to run the LambdaG method from idiolect. Used aas a baseline for my methods.

In [84]:
invisible(remove.packages("idiolect"))
invisible(install.packages("idiolect"))

Removing package from ‘/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library’
(as ‘lib’ is unspecified)
installing the source package ‘idiolect’



trying URL 'https://cran.rstudio.com/src/contrib/idiolect_1.1.1.tar.gz'
Content type 'application/x-gzip' length 898863 bytes (877 KB)
downloaded 877 KB

* installing *source* package ‘idiolect’ ...
** package ‘idiolect’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading


1: package ‘quanteda’ was built under R version 4.2.3 
2: In .recacheSubclasses(def@className, def, env) :
  undefined subclass "ndiMatrix" of class "replValueSp"; definition not updated
3: In .recacheSubclasses(def@className, def, env) :
  undefined subclass "pcorMatrix" of class "replValueSp"; definition not updated


** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location


  undefined subclass "ndiMatrix" of class "replValueSp"; definition not updated
  undefined subclass "pcorMatrix" of class "replValueSp"; definition not updated


** testing if installed package can be loaded from final location


  undefined subclass "ndiMatrix" of class "replValueSp"; definition not updated
  undefined subclass "pcorMatrix" of class "replValueSp"; definition not updated


** testing if installed package keeps a record of temporary installation path
* DONE (idiolect)

The downloaded source packages are in
	‘/private/var/folders/xx/hy496x3x5sn4hy9gy1fk19lw0000gp/T/RtmpY4C8wk/downloaded_packages’


In [85]:
suppressWarnings(
  suppressPackageStartupMessages(
    {
      library(writexl)
      library(idiolect)
      library(dplyr)
    }
  )
)

In [86]:
run_lambdag <- function(metadata, known, unknown){

  metadata <- metadata |>
    dplyr::mutate(target = known_author == unknown_author)

  out <- vector("list", nrow(metadata) * 5L)
  idx <- 1L

  for(i in seq_len(nrow(metadata))){

    message("Row ", i, " of ", nrow(metadata))
    
    selected_problem <- metadata[i, , drop = FALSE]

    k  <- as.character(selected_problem$known_author)
    u  <- as.character(selected_problem$unknown_author)
    c_ <- as.character(selected_problem$corpus)

    known_subset <- quanteda::corpus_subset(known, corpus == c_ & author == k) |>
      contentmask(algorithm = "POSnoise") |>
      quanteda::tokens("sentence")

    unknown_subset <- quanteda::corpus_subset(unknown, corpus == c_ & author == u) |>
      contentmask(algorithm = "POSnoise") |>
      quanteda::tokens("sentence")

    ref_subset <- quanteda::corpus_subset(known, corpus == c_ & author != k & author != u) |>
      contentmask(algorithm = "POSnoise") |>
      quanteda::tokens("sentence")

    for(j in seq_len(5L)){
      test_results <- lambdaG(unknown_subset, known_subset, ref_subset)
      test_score <- test_results$score

      out[[idx]] <- cbind(selected_problem, rep = j, score = test_score)
      idx <- idx + 1L
    }
  }

  final_results <- dplyr::bind_rows(out)

  final_results |>
    group_by(across(all_of(names(metadata)))) |>
    summarise(score = mean(score, na.rm = TRUE), .groups = "drop")

}

In [87]:
base_read_loc <- "/Volumes/BCross/datasets/author_verification"
base_save_loc <- "/Volumes/BCross/av_datasets_experiments/Baseline Results"

data_types <- c("test", "training")

corpora <- c("Wiki", "Enron", "Perverted Justice", "StackExchange",
            "ACL", "TripAdvisor", "The Apricity", "Koppel's Blogs",
            "The Telegraph", "Reddit")

corpora <- c("Wiki", "Enron", "Perverted Justice", "ACL", "StackExchange")

for(d_type in data_types){

  known <- readRDS(paste0(base_read_loc, "/", d_type, "/known_raw.rds"))
  unknown <- readRDS(paste0(base_read_loc, "/", d_type, "/unknown_raw.rds"))
  metadata <- readRDS(paste0(base_read_loc, "/", d_type, "/metadata.rds"))

  for(c_ in corpora){
    message("Working on: ", d_type, " - ", c_)
    
    base_write_loc <- paste0(base_save_loc, "/", d_type, "/", c_)
    write_loc <- paste0(base_write_loc, "/lambdag_results.rds")

    dir.create(base_write_loc, recursive = TRUE, showWarnings = FALSE)
    
    # Skip the file if it already exists
    if (file.exists(write_loc)){
      message("Skipping....")
      next
    }

    filtered_metadata <- metadata |>
      dplyr::filter(corpus == c_)

    results <- run_lambdag(filtered_metadata, known, unknown)

    joined_results <- cbind(data_type=d_type, results)

    saveRDS(joined_results, write_loc)
  }
}

Working on: test - Wiki
Skipping....
Working on: test - Enron
Skipping....
Working on: test - Perverted Justice
Skipping....
Working on: test - ACL
Skipping....
Working on: test - StackExchange
Row 1 of 228
successfully initialized (spaCy Version: 3.8.2, language model: en_core_web_sm)
successfully initialized (spaCy Version: 3.8.2, language model: en_core_web_sm)
successfully initialized (spaCy Version: 3.8.2, language model: en_core_web_sm)
  |                                                  | 0 % ~calculating   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Row 2 of 228
successfully initialized (spaCy Version: 3.8.2, language model: en_core_web_sm)
successfully initi

## Combine Files

In [88]:
overall_list <- list()

for (d_type in data_types) {
  
  d_type_list <- list()
  
  d_type_dir <- file.path(base_save_loc, d_type)
  dir.create(d_type_dir, recursive = TRUE, showWarnings = FALSE)
  d_type_save_loc <- file.path(d_type_dir, "lambdag_results.rds")
  
  for (c_ in corpora) {
    
    read_loc <- file.path(base_save_loc, d_type, c_, "lambdag_results.rds")
    
    if (!file.exists(read_loc)) {
      message("Skipping (missing): ", read_loc)
      next
    }
    
    data <- readRDS(read_loc)
    d_type_list[[length(d_type_list) + 1]] <- data
  }
  
  if (length(d_type_list) == 0) {
    message("No files found for data type: ", d_type)
    next
  }
  
  d_type_combined <- bind_rows(d_type_list)
  saveRDS(d_type_combined, d_type_save_loc)
  
  overall_list[[length(overall_list) + 1]] <- d_type_combined
}

if (length(overall_list) > 0) {
  overall_combined <- bind_rows(overall_list)
  overall_save_loc <- file.path(base_save_loc, "lambdag_results.rds")
  saveRDS(overall_combined, overall_save_loc)
} else {
  message("No files found across any data types.")
}