Skip to content

Releases: Hebaelfiqi/PWC4.5

Release list

PWC4.5 v1.0.0

Choose a tag to compare

@Hebaelfiqi Hebaelfiqi released this 02 Jul 15:33

This release accompanies the doctoral research of Heba El-Fiqi (PhD thesis, Detection of Translator Stylometry using Pair-wise Comparative Classification and Network Motif Mining, UNSW Canberra, 2013) and its associated publications. It provides the reference implementation of the PWC4.5 algorithm together with the processed datasets used in that research.

Algorithm

PWC4.5 is a decision tree algorithm for Pairwise Comparative Classification Problems (PWCCP) that extends the C4.5 induction algorithm. In PWCCP, instances are organised in pairs and the discriminative signal resides in the relationship between the paired feature values rather than in individual instances; PWC4.5 selects splits on induced within-pair relations (minimum / equal / maximum) using C4.5's gain-ratio criterion. The algorithm is introduced in the 2013 thesis and described in detail in El-Fiqi, Petraki, and Abbass (ACM TALLIP, 2016).

Datasets

The release contains the derived feature representations used in the experiments; it does not redistribute any source texts.

  • Synthetic benchmarks — 2D and 5D XOR-based datasets across eight noise levels, each with ten independently sampled replications (as used in the thesis and the 2016 TALLIP article).
  • Translator stylometry — 21 pairwise datasets of network-motif frequency features for the seven-translator Arabic-to-English Holy Qur'an corpus (74 chapters; the final six parts / juz'). Each instance is a numeric feature vector — the frequencies of the 13 size-three and 199 size-four directed word-adjacency network motifs — labelled by translator. This feature representation follows the network-motif approach of the 2013 thesis and El-Fiqi, Petraki, and Abbass (PLOS ONE, 2019).

Data provenance and acknowledgements

The features were derived from third-party resources, gratefully acknowledged:

  • Translations — obtained from the Tanzil Qur'an project (translators: Muhammad Asad, Abdul Majid Daryabadi, Abul Ala Maududi, Mohammed Marmaduke Pickthall, Ahmed Raza Khan, Muhammad Sarwar, Abdullah Yusuf Ali).
  • Preprocessing (lemmatization) — the Natural Language Toolkit (NLTK).
  • Network-motif counting (size 3 and 4) — the Mfinder motif-detection tool.

Only the resulting numeric feature vectors are distributed here; the underlying translation texts are not included.

Usage

The attached pwc45-1.0.0.jar is self-contained (Apache Commons CLI bundled) and requires only a Java runtime (Java 7 or later; verified with Temurin 17):

java -jar pwc45-1.0.0.jar -ip data//2d_data//1st_exp// -f 2D_Noise_0.0 -u

License and citation

Source code under the MIT License; datasets under CC BY 4.0. Please cite the publications listed in the repository README / CITATION.cff. Primary references: the PWC4.5 algorithm — El-Fiqi, Petraki, and Abbass, ACM TALLIP 16(1), art. 2, 2016 (10.1145/2898997); the network-motif translator features — El-Fiqi, Petraki, and Abbass, PLOS ONE 14(2):e0211809, 2019 (10.1371/journal.pone.0211809).