# Experiments 6 & 7 - Dataset 1

For the dataset [GPT vs. Human: A Corpus of Research Abstracts](https://www.kaggle.com/datasets/heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts) we run experiments 6 & 7:

All experiments consist of training three ML models:
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

The Multinomial Naive Bayes model had to be discarded because it's 'incompatible' with embeddings (because of possible negative values arising as features).

They are trained under a [Cross-Validation Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for finding optimal parameters for each model.


Main difference between each experiment is:
- Experiment 6 uses word2vec embedding
- Experiment 7 uses GloVe embedding

### Notebook setup

In [None]:
# Remember to restart session after installing gensim: Runtime > Restart session
!pip install gensim -U

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m532.1 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━

In [None]:
# Pre-trained GloVe Embedding
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2025-03-20 02:38:40--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-03-20 02:38:40--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-03-20 02:38:40--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
# Core libraries
import os
import sys
import pickle

import pandas as pd
import kagglehub

import numpy as np
import torch

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize

import gensim
from gensim.models import KeyedVectors

# ML libraries
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Plotting libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.colors
plotly_colors = plotly.colors.qualitative.Plotly

# Tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## Data prepping

### Dataset download

In [None]:
path = kagglehub.dataset_download("heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts")
dataset_path = os.path.join(path, "data_set.csv")

data = pd.read_csv(dataset_path)
data.drop(columns=['title', 'ai_generated'], inplace=True)

# Get longest label amount of data
data_size = max(len(data[data['is_ai_generated'] == 0]), len(data[data['is_ai_generated'] == 1]))
data_size += 0.1*data_size

# Peek into the data
print("\nPeek into the dataset: heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts\n")
display(data)


fig = make_subplots(rows=1, cols=2, subplot_titles=('Original Dataset', 'Balanced Dataset'),
                    horizontal_spacing=0.3)

fig.add_trace(go.Histogram(x=data['is_ai_generated'], name='Original Dataset', marker_color=[plotly_colors[2], plotly_colors[1]]), row=1, col=1)

# Remove data to balance the dataset and speed up training
data = data.drop(data[data['is_ai_generated'] == 1].sample(53).index)
data = data.drop(data[data['is_ai_generated'] == 0].sample(200).index)


fig.add_trace(go.Histogram(x=data['is_ai_generated'], name='Balanced Dataset', marker_color=[plotly_colors[2], plotly_colors[1]]), row=1, col=2)

fig.update_layout(showlegend=False, width=700, bargap=0.4,
                  plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white",
                  xaxis=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']), xaxis2=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  yaxis=dict(title='Count'), yaxis2=dict(title='Count'), yaxis_range=[0, data_size], yaxis2_range=[0, data_size])
fig.show()

Downloading from https://www.kaggle.com/api/v1/datasets/download/heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts?dataset_version_number=1...


100%|██████████| 1.10M/1.10M [00:00<00:00, 73.7MB/s]

Extracting files...

Peek into the dataset: heleneeriksen/gpt-vs-human-a-corpus-of-research-abstracts






Unnamed: 0,abstract,is_ai_generated
0,Advanced electromagnetic potentials are indi...,0
1,This research paper investigates the question ...,1
2,We give an algorithm for finding network enc...,0
3,The paper presents an efficient centralized bi...,1
4,We introduce an exponential random graph mod...,0
...,...,...
4048,This research paper investigates the vortex dy...,1
4049,Given a remarkable representation of the gen...,0
4050,The Veldkamp space of two-qubits is a mathemat...,1
4051,The equilibration of macroscopic degrees of ...,0


### Data prepping for word2vec

In [None]:
# Tokenize paragraphs (and lowercase)
data["tokens"] = data["abstract"].apply(lambda x: word_tokenize(x.lower()))

# Train Word2Vec model
w2v_model = gensim.models.Word2Vec(data["tokens"], vector_size=300, window=10, min_count=40, workers=4) # Adjust parameters as needed.

def get_document_vector_w2v(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if not vectors:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

data["doc_vector_w2v"] = data["tokens"].apply(lambda x: get_document_vector_w2v(x, w2v_model, w2v_model.vector_size))

# Stack document vectors into a NumPy array
X = np.stack(data["doc_vector_w2v"].values)

# Get labels
y = data["is_ai_generated"].values

In [None]:
# Split the data into training (80) and testing (20)
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(X, y, test_size=0.2, random_state=0)

# Plot the distribution of the split
fig = go.Figure()

fig.add_trace(go.Histogram(x=y_train_w2v, name="training", marker_color='lightslategray'))
fig.add_trace(go.Histogram(x=y_test_w2v, name="testing", marker_color='crimson'))

fig.update_layout(title_text='Split Dataset', xaxis_title_text='Labels', yaxis_title_text='Count',
                  barmode='overlay', bargap=0.4,
                  plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white",
                  xaxis=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  xaxis2=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  width=500, height=500)
fig.show()

### Data prepping for GloVe

In [None]:
# Tokenize paragraphs (and lowercase)
data["tokens"] = data["abstract"].apply(lambda x: word_tokenize(x.lower()))

# Load the pre-trained GloVe embeddings using KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("glove.6B.100d.txt", binary=False, no_header=True) # Adjust path and parameters as needed

def get_document_vector_glove(tokens, model):
    vectors = [model[word] for word in tokens if word in model]
    if not vectors:
        return np.zeros(model.vector_size)  # Use model.vector_size for dynamic dimension
    return np.mean(vectors, axis=0)

data["doc_vector_glove"] = data["tokens"].apply(lambda x: get_document_vector_glove(x, glove_model))

# Stack document vectors into a NumPy array
X = np.stack(data["doc_vector_glove"].values)

# Get labels
y = data["is_ai_generated"].values # Replace "ai_generated" with your label column name.

In [None]:
# Split the data into training (80) and testing (20)
X_train_glove, X_test_glove, y_train_glove, y_test_glove = train_test_split(X, y, test_size=0.2, random_state=0)

# Plot the distribution of the split
fig = go.Figure()

fig.add_trace(go.Histogram(x=y_train_glove, name="training", marker_color='lightslategray'))
fig.add_trace(go.Histogram(x=y_test_glove, name="testing", marker_color='crimson'))

fig.update_layout(title_text='Split Dataset', xaxis_title_text='Labels', yaxis_title_text='Count',
                  barmode='overlay', bargap=0.4,
                  plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white",
                  xaxis=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  xaxis2=dict(tickmode='array', tickvals=[0, 1], ticktext=['human', 'ai']),
                  width=500, height=500)
fig.show()

## Methodology

### Libraries and experiments setup

In [None]:
# Cross-validation folds
cv = KFold(n_splits=5)

# Parameter grid for GridSearchCV
rf_param = {'classifier__n_estimators': [10, 50, 100, 200], 'classifier__max_depth': [None, 10, 50, 100]}
lr_param = {'classifier__C': [0.1, 0.5, 1.0, 2.0], 'classifier__max_iter': [100, 200, 300, 400]}
svc_param = {'classifier__C': [0.1, 0.5, 1.0, 2.0], 'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid']}

# Models to use
models = {'RandomForest': RandomForestClassifier(), 'LogisticRegression': LogisticRegression(), 'SVC': SVC()}
parameters = {'RandomForest': rf_param, 'LogisticRegression': lr_param, 'SVC': svc_param}

### Experiment 6

In [None]:
experiment_6 = dict()
predictions_6 = dict()

for model in models:
  # Create the model pipeline for bigrams
  experiment_6[model+'_pip'] = Pipeline([('classifier', models[model])])

  # Create the grid search model for each pipeline and parameters
  experiment_6[model] = GridSearchCV(experiment_6[model+'_pip'], parameters[model], cv=cv, scoring='accuracy', verbose=2)

  # Train & predict
  print(f"\n\tTraining the '{model}' model for bigrams...")
  experiment_6[model].fit(X_train_w2v, y_train_w2v)
  predictions_6[model] = experiment_6[model].predict(X_test_w2v)


	Training the 'RandomForest' model for bigrams...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.7s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.8s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   2.6s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   2.2s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   2.4s
[CV] END classifier__max_depth=None, c

### Experiment 7

In [None]:
experiment_7 = dict()
predictions_7 = dict()

for model in models:
  # Create the model pipeline for bigrams
  experiment_7[model+'_pip'] = Pipeline([('classifier', models[model])])

  # Create the grid search model for each pipeline and parameters
  experiment_7[model] = GridSearchCV(experiment_7[model+'_pip'], parameters[model], cv=cv, scoring='accuracy', verbose=2)

  # Train & predict
  print(f"\n\tTraining the '{model}' model for bigrams...")
  experiment_7[model].fit(X_train_glove, y_train_glove)
  predictions_7[model] = experiment_7[model].predict(X_test_glove)


	Training the 'RandomForest' model for bigrams...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=10; total time=   0.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.3s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.4s
[CV] END classifier__max_depth=None, classifier__n_estimators=50; total time=   1.7s
[CV] END classifier__max_depth=None, c

## Metrics and reports

### Classification Reports

In [None]:
cm_6 = dict()
cm_7 = dict()


for model in models:
  # Print the classification report
  print(f"\tExperiment 6 - '{model}':")
  print(classification_report(y_test_w2v, predictions_6[model]))

  print(f"\tExperiment 7 - '{model}':")
  print(classification_report(y_test_glove, predictions_7[model]))

  # Confusion Matrix plot
  cm_6[model] = confusion_matrix(y_test_w2v,  predictions_6[model])
  cm_7[model] = confusion_matrix(y_test_glove,  predictions_7[model])

	Experiment 6 - 'RandomForest':
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       365
           1       0.99      0.99      0.99       395

    accuracy                           0.99       760
   macro avg       0.99      0.99      0.99       760
weighted avg       0.99      0.99      0.99       760

	Experiment 7 - 'RandomForest':
              precision    recall  f1-score   support

           0       0.93      0.98      0.95       365
           1       0.98      0.93      0.95       395

    accuracy                           0.95       760
   macro avg       0.95      0.95      0.95       760
weighted avg       0.95      0.95      0.95       760

	Experiment 6 - 'LogisticRegression':
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       365
           1       1.00      0.99      0.99       395

    accuracy                           0.99       760
   macro avg       0.99   

### Experiment 6 - Confussion matrixes

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=('Random Forest', 'Logistic Regression', 'Support Vector'),
                    horizontal_spacing=0.2)

for model in models:
  match model:
    case 'RandomForest':
      pos = [1, 1]
    case 'LogisticRegression':
      pos = [1, 2]
    case 'SVC':
      pos = [1, 3]

  fig.add_trace(go.Heatmap(z=cm_6[model], x=['AI', 'Human'], y=['AI', 'Human'], coloraxis='coloraxis', text=cm_6[model], texttemplate="%{text}"),
                row=pos[0], col=pos[1])

fig.update_layout(title="Experiment 6 - word2vec", coloraxis=dict(colorscale='Burgyl'), showlegend=False, plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white", width=900, height=325)
fig.show()

### Experiment 7 - Confussion matrixes

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=('Random Forest', 'Logistic Regression', 'Support Vector'),
                    horizontal_spacing=0.2)

for model in models:
  match model:
    case 'RandomForest':
      pos = [1, 1]
    case 'LogisticRegression':
      pos = [1, 2]
    case 'SVC':
      pos = [1, 3]

  fig.add_trace(go.Heatmap(z=cm_7[model], x=['AI', 'Human'], y=['AI', 'Human'], coloraxis='coloraxis', text=cm_7[model], texttemplate="%{text}"),
                row=pos[0], col=pos[1])

fig.update_layout(title="Experiment 7 - GloVe", coloraxis=dict(colorscale='Burgyl'), showlegend=False, plot_bgcolor="rgba(0,0,0,0)", paper_bgcolor="rgba(0,0,0,0)", font_color="white", width=900, height=325)
fig.show()

## Custom input test

In [None]:
test_abstract = "This study investigates the efficacy of transformer-based models in generating coherent and contextually relevant research abstracts. We analyze a corpus of AI-generated abstracts against a benchmark dataset of human-authored abstracts, focusing on metrics such as lexical diversity, semantic coherence, and adherence to established academic writing conventions. Preliminary findings indicate that while AI models can produce syntactically sound abstracts, challenges remain in capturing the nuanced argumentation and critical insights characteristic of human scholarship. We explore potential avenues for refining these models to bridge this gap, including the integration of domain-specific knowledge and improved contextual understanding."

test_abstract = word_tokenize(test_abstract.lower())
test_abstract_w2v = get_document_vector_w2v(test_abstract, w2v_model, w2v_model.vector_size)
test_abstract_w2v = np.array(test_abstract_w2v).reshape(1, -1)

test_abstract_glove = get_document_vector_glove(test_abstract, glove_model)
test_abstract_glove = np.array(test_abstract_glove).reshape(1, -1)

results = []

for model in models:
       prediction_6 = experiment_6[model].predict(test_abstract_w2v)[0]
       prediction_7 = experiment_7[model].predict(test_abstract_glove)[0]
       results.append([model, prediction_6, prediction_7])

df = pd.DataFrame(results, columns=['Model', 'Experiment 6', 'Experiment 7'])
df.replace({0: 'Human', 1: 'AI'}, inplace=True)
display(df)


Unnamed: 0,Model,Experiment 6,Experiment 7
0,RandomForest,AI,AI
1,LogisticRegression,AI,AI
2,SVC,AI,AI


In [None]:
import joblib

# Save the pipelines for experiment 6
for model in models:
  filename = f'{model}_pipeline_experiment_6.joblib'
  joblib.dump(experiment_6[model].best_estimator_, filename)
  print(experiment_6[model].best_estimator_)

# Save the pipelines for experiment 7
for model in models:
  filename = f'{model}_pipeline_experiment_7.joblib'
  joblib.dump(experiment_6[model].best_estimator_, filename)

print("Pipelines saved successfully!")

# Save the word2vec model
w2v_model.save("word2vec_model.model")

# Save the GloVe model
glove_model.save("glove_model.model")

print("Word2Vec and GloVe models saved successfully!")


Pipeline(steps=[('classifier', RandomForestClassifier(max_depth=10))])
Pipeline(steps=[('classifier', LogisticRegression(C=2.0))])
Pipeline(steps=[('classifier', SVC(C=2.0, kernel='linear'))])
Pipelines saved successfully!
Word2Vec and GloVe models saved successfully!
