-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Kody Moodley edited this page Apr 13, 2021
·
1 revision
A python script for measuring the overlap or agreement between the text similarity and citation / reference relationships between an input set of documents.
- A set of textual documents in .txt format. Each file should be named with a unique identifier representing that document. E.g.
10345.txt
. - A CSV file called
citations.csv
with two columns: 1st column contains the unique ID of a citing or referring document. 2nd column contains the unique ID of the document which is cited by or referred to by the document in column 1. - A CSV file called
sample.csv
with 1 column which contains the unique IDs of a sample from the full set of documents. DoConA will be run on these input documents. -
Optional: A CSV file called
stopwords.csv
which contains a list of words which should be removed from each text document during the preprocessing phase of the DoConA pipeline. The file should contain exactly one column with no header and each word should appear on a new line in the file.
The arrangement of the files in your folder should look like the following image:
- TF-IDF
- Jaccard distance
- N-grams
- Pre-trained and custom word2vec word embeddings and doc2vec document embeddings
DoConA will generate a results.csv
file which looks like this:
The file contains the following five columns:
-
source_document
specifying a unique document identifier -
similar_document
another unique document identifier representing a document which is similar tosource_document
-
similarity_score
a floating point number between 0 and 1 representing the degree to whichsource_document
andsimilar_document
are similar according to a particular text similarity measure -
method
a name for the text similarity measure used to generate the number insimilarity_score
-
citation_link
a boolean value which specifies whethersource_document
also citessimilar_document
(or vice versa) in the citation network of the input documents
Please see main README of this repo for instructions on how to run the pipeline.