# Experiments 1 & 2 - Dataset 1

For the dataset [GPT vs. Human: A Corpus of Research Abstracts](https://www.kaggle.com/datasets/heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts) we run experiments 1 & 2:

Both experiments consist of training four ML models:
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
- [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

They are trained under a [Cross-Validation Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for finding optimal parameters for each model.

Main difference between each experiment is:
- Experiment 1 uses Bag of Words feature ([CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html))
- Experiment 2 uses TF-IDF feature ([TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html))

### Notebook setup

In [None]:
# Core libraries
import os
import sys
import pickle

import pandas as pd
import kagglehub

import numpy as np
import torch

# ML libraries
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Plotting libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.colors
plotly_colors = plotly.colors.qualitative.Plotly

## Data prepping

### Dataset download

In [None]:
path = kagglehub.dataset_download("heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts")
dataset_path = os.path.join(path, "data_set.csv")

data = pd.read_csv(dataset_path)
data.drop(columns=['title', 'ai_generated'], inplace=True)

# Get longest label amount of data
data_size = max(len(data[data['is_ai_generated'] == 0]), len(data[data['is_ai_generated'] == 1]))
data_size += 0.1*data_size

# Peek into the data
print("\nPeek into the dataset: heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts\n")
display(data)


fig = make_subplots(rows=1, cols=2, subplot_titles=('Original Dataset', 'Balanced Dataset'),
                    horizontal_spacing=0.3)

fig.add_trace(go.Histogram(x=data['is_ai_generated'], name='Original Dataset', marker_color=[plotly_colors[2], plotly_colors[1]]), row=1, col=1)

# Remove data to balance the dataset and speed up training
data = data.drop(data[data['is_ai_generated'] == 1].sample(53).index)
data = data.drop(data[data['is_ai_generated'] == 0].sample(200).index)


fig.add_trace(go.Histogram(x=data['is_ai_generated'], name='Balanced Dataset', marker_color=[plotly_colors[2], plotly_colors[1]]), row=1, col=2)

fig.update_layout(showlegend=False, width=700, bargap=0.4,
                  plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white",
                  xaxis=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']), xaxis2=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  yaxis=dict(title='Count'), yaxis2=dict(title='Count'), yaxis_range=[0, data_size], yaxis2_range=[0, data_size])
fig.show()

Downloading from https://www.kaggle.com/api/v1/datasets/download/heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts?dataset_version_number=1...


100%|██████████| 1.10M/1.10M [00:00<00:00, 63.1MB/s]

Extracting files...






Peek into the dataset: heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts



Unnamed: 0,abstract,is_ai_generated
0,Advanced electromagnetic potentials are indi...,0
1,This research paper investigates the question ...,1
2,We give an algorithm for finding network enc...,0
3,The paper presents an efficient centralized bi...,1
4,We introduce an exponential random graph mod...,0
...,...,...
4048,This research paper investigates the vortex dy...,1
4049,Given a remarkable representation of the gen...,0
4050,The Veldkamp space of two-qubits is a mathemat...,1
4051,The equilibration of macroscopic degrees of ...,0


In [None]:
# Split the data into training (80) and testing (20)
X_train, X_test, y_train, y_test = train_test_split(data['abstract'], data['is_ai_generated'], test_size=0.2, random_state=0)

# Plot the distribution of the split
fig = go.Figure()

fig.add_trace(go.Histogram(x=y_train, name="training", marker_color='lightslategray'))
fig.add_trace(go.Histogram(x=y_test, name="testing", marker_color='crimson'))

fig.update_layout(title_text='Split Dataset', xaxis_title_text='Labels', yaxis_title_text='Count',
                  barmode='overlay', bargap=0.4,
                  plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white",
                  xaxis=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  xaxis2=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  width=500, height=500)
fig.show()

## Methodology

### Libraries and experiments setup

In [None]:
# Cross-validation folds
cv = KFold(n_splits=5)

# Parameter grid for GridSearchCV
rf_param = {'classifier__n_estimators': [10, 50, 100, 200], 'classifier__max_depth': [None, 10, 50, 100]}
nb_param = {'classifier__alpha': [0.1, 0.5, 1.0, 2.0]}
lr_param = {'classifier__C': [0.1, 0.5, 1.0, 2.0], 'classifier__max_iter': [100, 200, 300, 400]}
svc_param = {'classifier__C': [0.1, 0.5, 1.0, 2.0], 'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid']}

# Models to use
models = {'RandomForest': RandomForestClassifier(), 'NaiveBayes': MultinomialNB(), 'LogisticRegression': LogisticRegression(), 'SVC': SVC()}
parameters = {'RandomForest': rf_param, 'NaiveBayes': nb_param, 'LogisticRegression': lr_param, 'SVC': svc_param}

### Experiment 1

In [None]:
experiment_1 = dict()
predictions_1 = dict()

for model in models:
  # Create the model pipeline
  experiment_1[model+'_pip'] = Pipeline([('vectorizer', CountVectorizer()), ('classifier', models[model])])

  # Create the grid search model with the pipeline and parameters
  experiment_1[model] = GridSearchCV(experiment_1[model+'_pip'], parameters[model], cv=cv, scoring='accuracy', verbose=2)

  # Train & predict
  print(f"\n\tTraining the '{model}' model...")
  experiment_1[model].fit(X_train, y_train)
  predictions_1[model] = experiment_1[model].predict(X_test)


	Training the 'RandomForest' model...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   2.1s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   1.1s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   1.1s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   0.9s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   0.9s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.3s
[CV] END classifier__max_depth=None, classifier__n

### Experiment 2

In [None]:
experiment_2 = dict()
predictions_2 = dict()

for model in models:
  # Create the model pipeline
  experiment_2[model+'_pip'] = Pipeline([('vectorizer', TfidfVectorizer()), ('classifier', models[model])])

  # Create the grid search model with the pipeline and parameters
  experiment_2[model] = GridSearchCV(experiment_2[model+'_pip'], parameters[model], cv=cv, scoring='accuracy', verbose=2)

  # Train & predict
  print(f"\n\tTraining the '{model}' model...")
  experiment_2[model].fit(X_train, y_train)
  predictions_2[model] = experiment_2[model].predict(X_test)


	Training the 'RandomForest' model...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.7s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   1.2s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.8s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.1s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.0s
[CV] END classifier__max_depth=None, classifier__n

## Metrics and reports

### Classification Reports

In [None]:
cm_1 = dict()
cm_2 = dict()

for model in models:
  # Print the classification report
  print(f"\tExperiment 1 - '{model}' Classifier:")
  print(classification_report(y_test, predictions_1[model]))

  print(f"\n\tExperiment 2 - '{model}' Classifier:")
  print(classification_report(y_test, predictions_2[model]), "\n\n\n")

  # Confusion Matrix plot
  cm_1[model] = confusion_matrix(y_test, predictions_1[model])
  cm_2[model] = confusion_matrix(y_test, predictions_2[model])

	Experiment 1 - 'RandomForest' Classifier:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       377
           1       1.00      0.99      1.00       383

    accuracy                           1.00       760
   macro avg       1.00      1.00      1.00       760
weighted avg       1.00      1.00      1.00       760


	Experiment 2 - 'RandomForest' Classifier:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       377
           1       0.99      0.99      0.99       383

    accuracy                           0.99       760
   macro avg       0.99      0.99      0.99       760
weighted avg       0.99      0.99      0.99       760
 



	Experiment 1 - 'NaiveBayes' Classifier:
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       377
           1       0.97      0.98      0.98       383

    accuracy                           0.98       

### Experiment 1 - Confussion matrixes

In [None]:
fig = make_subplots(rows=2, cols=2, subplot_titles=('Random Forest', 'Naive Bayes', 'Logistic Regression', 'Support Vector'),
                    horizontal_spacing=0.3, vertical_spacing=0.25)

for model in models:
  match model:
    case 'RandomForest':
      pos = [1, 1]
    case 'NaiveBayes':
      pos = [1, 2]
    case 'LogisticRegression':
      pos = [2, 1]
    case 'SVC':
      pos = [2, 2]

  fig.add_trace(go.Heatmap(z=cm_1[model], x=['AI', 'Human'], y=['AI', 'Human'], coloraxis='coloraxis', text=cm_1[model], texttemplate="%{text}"),
                row=pos[0], col=pos[1])

fig.update_layout(title="Experiment 1 - Bag of words", coloraxis=dict(colorscale='Burgyl'), showlegend=False, plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white", width=700)
fig.show()

### Experiment 2 - Confussion matrixes

In [None]:
fig = make_subplots(rows=2, cols=2, subplot_titles=('Random Forest', 'Naive Bayes', 'Logistic Regression', 'Support Vector'),
                    horizontal_spacing=0.3, vertical_spacing=0.25)

for model in models:
  match model:
    case 'RandomForest':
      pos = [1, 1]
    case 'NaiveBayes':
      pos = [1, 2]
    case 'LogisticRegression':
      pos = [2, 1]
    case 'SVC':
      pos = [2, 2]

  fig.add_trace(go.Heatmap(z=cm_2[model], x=['AI', 'Human'], y=['AI', 'Human'], coloraxis='coloraxis', text=cm_2[model], texttemplate="%{text}"),
                row=pos[0], col=pos[1])

fig.update_layout(title="Experiment 2 - Term Frequency-Inverse Document Frequency", coloraxis=dict(colorscale='Burgyl'), showlegend=False, plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white", width=700)
fig.show()

## Custom input test

In [None]:
test_abstract = ["This study investigates the efficacy of transformer-based models in generating coherent and contextually relevant research abstracts. We analyze a corpus of AI-generated abstracts against a benchmark dataset of human-authored abstracts, focusing on metrics such as lexical diversity, semantic coherence, and adherence to established academic writing conventions. Preliminary findings indicate that while AI models can produce syntactically sound abstracts, challenges remain in capturing the nuanced argumentation and critical insights characteristic of human scholarship. We explore potential avenues for refining these models to bridge this gap, including the integration of domain-specific knowledge and improved contextual understanding."]
results = []

for model in models:
       prediction_1 = experiment_1[model].predict(test_abstract)[0]
       prediction_2 = experiment_2[model].predict(test_abstract)[0]
       results.append([model, prediction_1, prediction_2])

df = pd.DataFrame(results, columns=['Model', 'Experiment 1', 'Experiment 2'])
df.replace({0: 'Human', 1: 'AI'}, inplace=True)
display(df)

Unnamed: 0,Model,Experiment 1,Experiment 2
0,RandomForest,AI,AI
1,NaiveBayes,AI,AI
2,LogisticRegression,AI,AI
3,SVC,AI,AI
