Skip to content

SupervisedStylometry/SuperStyl

Repository files navigation

SUPERvised STYLometry

codecov

Installing

You will need python3.9 or later, pip and optionnaly virtualenv

git clone https://github.com/SupervisedStylometry/SuperStyl.git
cd SuperStyl
virtualenv -p python3.9 env #or later
source env/bin/activate
pip install -r requirements.txt

Basic usage

To use Superstyl, you have two options:

  1. Use the provided command-line interface from your OS terminal (tested on Linux)
  2. Import Superstyl in a Python script or notebook, and use the API commands

You also need a collection of files containing the text that you wish to analyse. The naming conventions of source files in Superstyl are as such:

Class_anythingthatyouwant

For instance:

Moliere_Amphitryon.txt

The text before the first underscore will be used as the class for training models.

Command-Line Interface

A very simple usage, for building a corpus of text character 3-grams frequencies, training a SVM model with leave-one-out cross-validation, and predicting the class of unknown texts, would be:

# Creating the corpus and extracting characters 3-grams from text files
python load_corpus.py -s data/train/*.txt -t chars -n 3 -o train
python load_corpus.py -s data/test/*.txt -t chars -n 3 -o unknown -f train_feats.json
# Training a SVM, with cross-validation, and using it to predict the class of unknown sample
python train_svm.py train.csv --test_path unknown.csv --cross_validate leave-one-out --final

The two first commands will write to the disk the files train.csv and unknown.csv containing the metadata and features frequencies for both sets of files, and a file train_feats.json containing a list of used features.

The last one will print the scores of the cross-validation, and then write to disk a file FINAL_PREDICTIONS.csv, containing the class predictions for the unknown texts.

This is just a small sample of all available corpus and training options.

To know more, do:

python load_corpus.py --help
python train_svm.py --help

Python API

A very simple usage, for building a corpus, training a SVM model with cross-validation, and predicting the class of an unknown text, would be:

import superstyl as sty
import glob
# Creating the corpus and extracting characters 3-grams from text files
train, train_feats = sty.load_corpus(glob.glob("data/train/*.txt"), 
                                           feats="chars", n=3)
unknown, unknown_feats = sty.load_corpus(glob.glob("data/test/*.txt"), 
                                         feat_list=train_feats, 
                                         feats="chars", n=3)
# Training a SVM, with cross-validation, and using it 
# to predict the class of unknown sample
sty.train_svm(train, unknown, cross_validate="leave-one-out", 
              final_pred=True)

This is just a small sample of all available corpus and training options.

To know more, do:

help(sty.load_corpus)
help(sty.train_svm)

Advanced usage

FIXME: look inside the scripts, or do

python load_corpus.py --help
python train_svm.py --help

for full documentation on the main functionnalities of the CLI, regarding data generation (main.py) and SVM training (train_svm.py).

For more particular data processing usages (splitting and merging datasets), see also:

python split.py --help
python merge_datasets.csv.py --help

Get feats

With or without preexisting feature list:

python load_corpus.py -s path/to/docs/* -t chars -n 3
# with it
python load_corpus.py -s path/to/docs/* -f feature_list.json -t chars -n 3
# There are several other available options
# See --help

Alternatively, you can build samples out of the data, for a given number of verses or words:

# words from txt
python load_corpus.py -s data/psyche/train/* -t chars -n 3 -x txt --sampling --sample_units words --sample_size 1000
# verses from TEI encoded docs
python load_corpus.py -s data/psyche/train/* -t chars -n 3 -x tei --sampling --sample_units verses --sample_size 200

You have a lot of options for feats extraction, inclusion or not of punctuation and symbols, sampling, source file formats, …, that can be accessed through the help.

Optional: Merge different features

You can merge several sets of features, extracted in csv with the previous commands, by doing:

python merge_datasets.csv.py -o merged.csv char3grams.csv words.csv affixes.csv

Optional: Do a fixed split

You can choose either choose to perform k-fold cross-validation (including leave-one-out), in which case this step is unnecessary. Or you can do a classical train/test random split.

If you want to do initial random split,

python split.py feats_tests.csv

If you want to split according to existing json file,

python split.py feats_tests.csv -s split.json

There are other available options, see --help, e.g.

python split.py feats_tests.csv -m langcert_revised.csv -e wilhelmus_train.csv

Train svm

It's quite simple really,

python train_svm.py path-to-train-data.csv [--test_path TEST_PATH] [--cross_validate {leave-one-out,k-fold}] [--k K] [--dim_reduc {pca}] [--norms] [--balance {class_weight,downsampling,Tomek,upsampling,SMOTE,SMOTETomek}] [--class_weights] [--kernel {LinearSVC,linear,polynomial,rbf,sigmoid}] [--final] [--get_coefs]

For instance, using leave-one-out or 10-fold cross-validation

# e.g.
python train_svm.py data/feats_tests_train.csv --norms --cross_validate leave-one-out
python train_svm.py data/feats_tests_train.csv --norms --cross_validate k-fold --k 10

Or a train/test split

# e.g.
python train_svm.py data/feats_tests_train.csv --test_path test_feats.csv --norms

And for a final analysis, applied on unseen data:

# e.g.
python train_svm.py data/feats_tests_train.csv --test_path unseen.csv --norms --final

With a little more options,

# e.g.
python train_svm.py data/feats_tests_train.csv --test_path unseen.csv --norms --class_weights --final --get_coefs

Sources

Cite this repository

You can cite it using the CITATION.cff file (and Github cite functionnalities), following:

@software{Camps_SUPERvised_STYLometry_SuperStyl_2021,author = {Camps, Jean-Baptiste},doi = {...},month = {...},title = {{SUPERvised STYLometry (SuperStyl)}},version = {...},year = {2021}}