# Detect LLM text #

### This notebook serves two purposes: ###

1. My Capstone project for the course "Data Science & AI (2024)" at UTS Institute of Data 
2. My submission to the Kaggle competition "LLM - Detect AI Generated Text" found at https://www.kaggle.com/competitions/llm-detect-ai-generated-text 

The Kaggle competition was posted just over a year ago (Nov 2023) and has had a very large response from the community: 

19,362 Entrants
5,264 Participants
4,358 Teams
110,052 Submissions

My goal is to get the best possible result from the competition by trying different cutting edge approaches to text classification, as well as making the optimal use of training data itself, e.g. through synthesis and different sources and cleaning strategies. 

## Data analysis ##

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Data from the competition

train_essays = pd.read_csv('data/comp_data/train_essays.csv')

In [6]:
train_essays.sample(10)

Unnamed: 0,id,prompt_id,text,generated
1153,d55ac878,1,"Dear, We dont need another voting crisis over ...",0
1020,b6aa5fd9,1,The Electoral College is a process that has be...,0
349,44e00070,0,The advantages of limiting car use can help th...,0
633,75383a8e,1,"Dear Senator, I think that Electoral College i...",0
644,766d1c26,1,"Dear state senator, After researching the Elec...",0
594,7014633b,1,"Dear Governor, I believe we need to keep the E...",0
732,85f97618,1,"Dear dumb Republican , The Electoral College i...",0
24,05665390,1,"Dear Florida State Senator, Although many coul...",0
493,606ec542,0,I think limiting car usage is a great idea for...,0
371,48cd2f1e,1,Americans throughout the country believe that ...,0


In [7]:
train_essays.shape

(1378, 4)

In [12]:
train_essays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378 entries, 0 to 1377
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1378 non-null   object
 1   prompt_id  1378 non-null   int64 
 2   text       1378 non-null   object
 3   generated  1378 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 43.2+ KB


In [16]:
# Prompts relevant to the competition

prompts = pd.read_csv('data/comp_data/train_prompts.csv')
prompts


Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,Does the electoral college work?,Write a letter to your state senator in which ...,# What Is the Electoral College? by the Office...


In [13]:
# Number of responses to each prompt in competition data 

train_essays['prompt_id'].value_counts()

prompt_id
0    708
1    670
Name: count, dtype: int64

So: 

708 x Car-free cities
670 x Does the electoral college work?

In [17]:
# Additional data from https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/data

drcat_02 = pd.read_csv('data/train_v2_drcat_02.csv')
drcat_02.sample(10)

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
7164,"From ballet to yearbook, there are many extrac...",0,Mandatory extracurricular activities,persuade_corpus,False
7863,Have you ever considered the fact that not man...,0,Mandatory extracurricular activities,persuade_corpus,False
14042,Picking up trash will be good for the communit...,0,Community service,persuade_corpus,False
36586,Inactivity has long been associated with negat...,1,Distance learning,mistral7binstruct_v1,False
28837,"When it comes to working with a group, there a...",1,Distance learning,chat_gpt_moth,False
3139,Certain schools make students complete a summe...,0,Summer projects,persuade_corpus,False
18504,The idea of driveless cars would be a brillian...,0,Driverless cars,persuade_corpus,True
3758,"As a high school student, school is very stres...",0,Summer projects,persuade_corpus,False
14357,"Dear Principal,\n\nEvery student should perfor...",0,Community service,persuade_corpus,False
6660,Students should be in at least one or more ext...,0,Mandatory extracurricular activities,persuade_corpus,False


In [10]:
drcat_02['prompt_name'].value_counts()

prompt_name
Distance learning                        5554
Seeking multiple opinions                5176
Car-free cities                          4717
Does the electoral college work?         4434
Facial action coding system              3084
Mandatory extracurricular activities     3077
Summer projects                          2701
Driverless cars                          2250
Exploring Venus                          2176
Cell phones at school                    2119
Grades for extracurricular activities    2116
Community service                        2092
"A Cowboy Who Rode the Waves"            1896
The Face on Mars                         1893
Phones and driving                       1583
Name: count, dtype: int64

In [25]:
# Isolating only responses relevant to the 2 prompts from the competition 

daigt_data = drcat_02[
    (drcat_02['prompt_name'] == 'Car-free cities') | 
    (drcat_02['prompt_name'] == 'Does the electoral college work?')
]

daigt_data.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
1168,Cars have been around for awhile and they have...,0,Car-free cities,persuade_corpus,True
1169,Have you ever thought what it would be like no...,0,Car-free cities,persuade_corpus,True
1170,What you are about to read is going to give yo...,0,Car-free cities,persuade_corpus,True
1171,cars have many flaws nd and in this day and ag...,0,Car-free cities,persuade_corpus,True
1172,There are many advantages of limiting car usag...,0,Car-free cities,persuade_corpus,True


In [27]:
daigt_data['prompt_name'].value_counts()

prompt_name
Car-free cities                     4717
Does the electoral college work?    4434
Name: count, dtype: int64

In [28]:
daigt_data['source'].value_counts()

source
persuade_corpus                       4005
kingki19_palm                         1384
train_essays                          1378
radek_500                              500
NousResearch/Llama-2-7b-chat-hf        400
mistralai/Mistral-7B-Instruct-v0.1     399
llama_70b_v1                           219
radekgpt4                              200
falcon_180b_v1                         181
darragh_claude_v7                      123
darragh_claude_v6                      104
cohere-command                          99
palm-text-bison1                        98
mistral7binstruct_v1                    18
mistral7binstruct_v2                    16
chat_gpt_moth                           15
llama2_chat                             12
Name: count, dtype: int64

In [None]:
daigt_data

Sources (please upvote the original datasets!):

Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset)
Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/)
Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b)
Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays)
2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic)
LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai)
Official train essays
Essays I generated with various LLMs
License: MIT for the data I generated. Check source datasets for the other sources mentioned above.

In [29]:
daigt_data['label'].value_counts()

label
0    5380
1    3771
Name: count, dtype: int64

In [None]:
# Summarise label by source (human = 0, llm = 1)

source_counts = pd.pivot_table(daigt_data, index='source', columns='label', aggfunc='size', fill_value=0)

print(source_counts)


label                                  0     1
source                                        
NousResearch/Llama-2-7b-chat-hf        0   400
chat_gpt_moth                          0    15
cohere-command                         0    99
darragh_claude_v6                      0   104
darragh_claude_v7                      0   123
falcon_180b_v1                         0   181
kingki19_palm                          0  1384
llama2_chat                            0    12
llama_70b_v1                           0   219
mistral7binstruct_v1                   0    18
mistral7binstruct_v2                   0    16
mistralai/Mistral-7B-Instruct-v0.1     0   399
palm-text-bison1                       0    98
persuade_corpus                     4005     0
radek_500                              0   500
radekgpt4                              0   200
train_essays                        1375     3


TEST TEXT FOR GIT PRACTICE