Skip to content

This was a project I worked on for my coursework at UBC Vancouver, Masters of Data Science Computational Linguistics

Notifications You must be signed in to change notification settings

MistryWoman/Classifying-human-text-from-GPT2-generated-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COLX_585_Trends_in_compuational_linguistics_project

Abstract

In the field of Natural Language Processing (NLP), there has been a dramatic shift towards utilizing pre-trained deep language models. To perform text classification, this study uses state-of-the-art neural network models, Bidirectional Encoder Representations of Transformers (BERT) and convolutional neural networks (CNN), as well as a non-neural classifier, Logistic Regression (LR), as baseline. Specifically, the task is to distinguish human-generated text from fake text generated by the GPT-2 language model.

Results show BERT beat the baseline, achieving 90.34% F-score compared to LR’s F-score of 88.23%. BERT and LR both outperformed CNN, which attained an F-score of 80.41%. In the error analysis, this study further confirms previous research’s finding that sequence length affects neural network models performance. For example, any truncation in the input documents has a detrimental influence on the effectiveness of BERT. The primary contribution of this study is to introduce a simple but effective model in the field of fake text detection.

Final Report and Presentation Link

Presentation Link

Report Link

Jupyter Notebooks

Logistic Regression

Convolutional Neural Network

BERT

About

This was a project I worked on for my coursework at UBC Vancouver, Masters of Data Science Computational Linguistics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published