LM-plagiarism

Repository of the paper "Do Language Models Plagiarize?", Proceedings of the ACM Web Conference 2023. A full paper can be found here.

Due to a large size of data, we uploaded everything in Zenodo.

Training Corpus

GPT-2's pre-training data is accessible via this link. You can find all fine-tuning datasets we used for our experiments in the finetuning_data.zip file. Special thanks to original creators of fine-tuning datasets: ArXiV, Cord-19, and Patent Claims.

GPT-2 Generated Text

All GPT-2 generated texts can be found in the GPT_generated_text.zip file. Documents in the txt file can be seperated by '===================='. Note that texts generated by pre-trained GPT-2 are originally from OpenAI. If the file includes '_v2', this means it is a secondary file to fulfill the document count. For example, in the PatentGPT folder, there are patentGPT_10000_temp.txt and PatentGPT_10000_temp_v2.txt. They both are generated by PatentGPT with temperature a setting, and a total number of documents sums up to 10000. We also uploaded fine-tuned models' checkpoints checkpoints.zip.

Automatic Plagiarism Detection

STEP 1. Finding Top-𝑛′ Candidate Documents

In our work, we levearge Elasticsearch to store a training corpus and retrieve candidate documents for plagiarism. A detailed description of Elasticsearch setup can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html.

STEP 2. Finding Plagiarized Text Pairs and Plagiarism Type

To extract plagiarism types and locations, go to the plagiarism_detection_tool.zip file and run PAN2015_plagiarism_detection.py using this command:

python PAN2015_plagiarism_detection.py

Make sure to replace these values OUTPUT_DIRECTORY, FILE, INDEX with your own configurations in the source code prior to running. Great thanks to the original creator of this tool which is available at https://www.gelbukh.com/plagiarism-detection/PAN-2015/.

Analysis

You can refer to analysis.ipnyb to reproduce the reported plagiarism results. We uploaded all identified plagiarism cases in the plagiarism_cases.zip file. Please note that source_id represents the index of machine-generated texts and susp_id represents the index of training samples. These numbers may not be reproducible when you run your own experiments because it depends on how training samples are indexed in Elasticsearch.

Please cite our paper with the following Bibtex:

@inproceedings{10.1145/3543507.3583199,
author = {Lee, Jooyoung and Le, Thai and Chen, Jinghui and Lee, Dongwon},
title = {Do Language Models Plagiarize?},
year = {2023},
isbn = {9781450394161},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3543507.3583199},
doi = {10.1145/3543507.3583199},
abstract = {Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs “reuse” a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs’ plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals’ personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.},
booktitle = {Proceedings of the ACM Web Conference 2023},
pages = {3637–3647},
numpages = {11},
keywords = {Language Models, Plagiarism, Natural Language Generation},
location = {Austin, TX, USA},
series = {WWW '23}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

LM-plagiarism

Training Corpus

GPT-2 Generated Text

Automatic Plagiarism Detection

Analysis

About

Releases

Packages

Brit7777/LM-plagiarism

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

LM-plagiarism

Training Corpus

GPT-2 Generated Text

Automatic Plagiarism Detection

Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages