# Documentation for Validation
Author: Paschalis Agapitos

---
## Principal Component Analysis and Bootstrap Consensus Strength)
### Overview
Notebook `valid_PCA_BCT` is designed to validate the methods used in the paper "A Stylometric Analysis of Seneca's Disputed Plays: Authorship Verification of *Octavia* and *Hercules Oetaeus*." The analysis will use a dataset containing 28 texts in verse written by three authors (presented below). The methods applied are Principal Component Analysis (PCA) and Bootstrap Consensus Tree (BCT) using 100-2000 Most Frequent Characters (MFCs) 4-grams.

### Setup

```{r, echo=FALSE}
# Install the 'stylo' package if it is not already installed
install.packages("stylo")
```

### Dataset Description

The dataset is stored in the `validation_corpus` directory and includes texts from:

- **Publius Ovidius Naso (Ovid)**:
  - *Ars Amatoria*
  - *Epistulae*
  - *Fasti*
  - *Ibis*
  - *Medicamina Faciei femineae*
  - *Metamorphoses*
  - *Ex Ponto*
  - *Remedia Amoris*
  - *Tristia*
- **Aulus Persius Flaccus (Persius)**:
  - The six books of *Satires*
- **Publius Papinius Statius (Statius)**:
  - The 12 books of *Thebaid*

Three texts have been renamed **manually** for validation:
- **Amores by Ovid**: `unknown0.txt`
- **Thebaid book 1 by Statius**: `unknown1.txt`
- **Satire 4 by Persius**: `unknown2.txt`

### Libraries

```{r}
library(stylo)
```

### Working Directory

Set the working directory to `validation_PCA_BCT/`.

```{r}
knitr::opts_knit$set(root.dir = '../../validation_PCA_BCT/')
setwd('../../validation_PCA_BCT/')
getwd()
```

### Importing and Tokenizing the Corpus

Load and tokenize the corpus, converting uppercase letters to lowercase to reduce orthographic variations.

```{r}
# Load the corpus for PCA validation
raw_corpus <- load.corpus(files = "all", corpus.dir = "../../validation_corpora/validation_corpus_PCA/", encoding = "UTF-8")

# Tokenize the corpus
tokenized_corpus <- txt.to.words.ext(raw_corpus, corpus.lang = "Latin.corr", preserve.case = FALSE)
```

### Removing Pronouns

Remove pronouns from the corpus to avoid genre-specific biases.

```{r}
# Remove pronouns from the tokenized corpus
corpus_no_pronouns <- delete.stop.words(tokenized_corpus, stop.words = stylo.pronouns(corpus.lang = "Latin.corr"))

# Display list of removed pronouns
stylo.pronouns(corpus.lang = "Latin.corr")
```

### Extracting Character 4-grams

Extract character 4-grams and create a frequency table.

```{r}
# Extract character 4-grams
corpus_char_4grams <- txt.to.features(corpus_no_pronouns, features = "c", ngram.size = 4)

# Create a frequency list of the 4-grams
frequent_features_4grams <- make.frequency.list(corpus_char_4grams, head = 2000)

# Create a table of frequencies for the 4-grams
freqs_4grams <- make.table.of.frequencies(corpus_char_4grams, features = frequent_features_4grams, relative = TRUE)
```

### Method Validation - Character 4-grams

#### Principal Component Analysis - Correlation Matrix (MFCs 4-grams)

Apply PCA using a correlation matrix to visualize the results with MFCs 4-grams. Cosine Delta is used as the distance metric.

```{r}
# PCA correlation - top 100-2000-100 incr.100 MFCs 4-grams
results_pca_4grams_cor <- stylo(frequencies = freqs_4grams, 
                                analysis.type = "PCR",
                                mfw.min = 100, mfw.max = 2000, increment = 100, 
                                distance.measure = "wurzburg", # Cosine Delta
                                custom.graph.title = "Who is the author?", 
                                pca.visual.flavour = "classic", 
                                write.pdf.file = TRUE, 
                                gui = TRUE) # GUI is used to double-check the parameters.
```

#### Bootstrap Consensus Tree - MFC 4-grams

Load and prepare the corpus for BCT validation, following the same preprocessing steps.

```{r}
# Load the corpus for BCT validation
raw_corpus_bct <- load.corpus(files = "all", corpus.dir = "../../../validation/validation_corpora/validation_corpus_BCT/", encoding = "UTF-8")

# Tokenize the corpus
tokenized_corpus_bct <- txt.to.words.ext(raw_corpus_bct, corpus.lang = "Latin.corr", preserve.case = FALSE)
```

Remove pronouns from the corpus.

```{r}
# Remove pronouns from the tokenized corpus
corpus_no_pronouns_bct <- delete.stop.words(tokenized_corpus_bct, stop.words = stylo.pronouns(corpus.lang = "Latin.corr"))

# Display list of removed pronouns
stylo.pronouns(corpus.lang = "Latin.corr")
```

Extract character 4-grams and create a frequency table.

```{r}
# Extract character 4-grams
corpus_char_4grams_bct <- txt.to.features(corpus_no_pronouns_bct, features = "c", ngram.size = 4)

# Create a frequency list of the 4-grams
frequent_features_4grams_bct <- make.frequency.list(corpus_char_4grams_bct, head = 5000)

# Create a table of frequencies for the 4-grams
freqs_4grams_bct <- make.table.of.frequencies(corpus_char_4grams_bct, features = frequent_features_4grams_bct, relative = TRUE)
```

Apply BCT with MFC 4-grams.

```{r}
# BCT 4grams - top 100-2000-100 MFC 4 grams - consensus strength 0.5
bct_results_4grams <- stylo(frequencies = freqs_4grams_bct, 
                            distance.measure = "wurzburg", # Cosine Delta
                            analysis.type = "BCT", 
                            mfw.min = 100, mfw.max = 2000, increment = 100, 
                            custom.graph.title = "Who is the author?", 
                            write.pdf.file = TRUE, 
                            gui = TRUE) # GUI is used to double-check. 
```