# CheckMaite
## Intro

After 2021, with the global deployment of LLM solution ChatGPT, a constatation has been made : about 80% of newly published abstracts were likely to be generated by AI. 

The goal of this report is not to judge whether this is good or bad news for research, knowing that although quality can be deteriored, and abstracts normalized, chatgpt opened the domain to new countries where english is not well spoken.

The goal of this project is to train a model that could be used to detect whether an abstract is written by a real person, or by an AI tool.

## Data preprocessing

The data used in this project is AI-GA Dataset (https://paperswithcode.com/dataset/ai-ga-ai-generated-abstracts-dataset), for which no benchmark has been created yet.

It is structured under csv format : title, text, label and contains 28663 rows.

Because the model we will use in the next parts needs to be under the shape of vectors, we need to preprocess each row. We use an open-source **text embedding** for that (https://huggingface.co/thenlper/gte-base). Because this embedding model only supports 512 tokens-length text, the question of how we will create the vectors is a first point of discussion. We decide in first time to keep only the first 512 tokens to build the embeddings. In a second time, we will create embeddings for all tokens, and train the model on that full set. When testing that method, we will use the gliding window technique and consider that the text is generated by AI if at least 1 of its constitute part is detected generated by AI.

The considered potential input for the model are : title embedding, abstract text embedding. In addition, if time allows us to do so, we will consider adding the lab associated to the publication, the research journal or conference, and the date of publish. It will require additional processing.

In [1]:
import torch.nn.functional as F
import torch
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import numpy as np
import matplotlib

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


dataset_url = "dataset/ai-ga-dataset.csv"
input_texts = pd.read_csv(dataset_url, usecols=['abstract']).values.flatten().tolist()

print(input_texts[:5])

  from .autonotebook import tqdm as notebook_tqdm


['OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were

In [None]:
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")

# loop and save embeddings in a file named embeddings.pt
for i in range(0, len(input_texts)):
    batch_dict = tokenizer(input_texts[i], max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    with open(f'embeddings/embeddings{i}.pt', 'ab') as f:
        torch.save(embeddings, f)
    with open('labels.txt', 'a') as f:
        f.write(f'{i}\n')

In [None]:
input_titles = pd.read_csv(dataset_url, usecols=['title']).values.flatten().tolist()

for i in range(0, len(input_titles)):
    batch_dict = tokenizer(input_titles[i], max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    torch.save(embeddings, f'embeddings/embeddings_titles{i}.pt')

In [None]:
# load embeddings from the file
input_titles = pd.read_csv(dataset_url, usecols=['title']).values.flatten().tolist()
embeddings = []
for i in range(0, len(input_texts)):
    embeddings.append(torch.load(f'embeddings/embeddings{i}.pt'))

embeddings_titles = []
for i in range(0, len(input_titles)):
    embeddings_titles.append(torch.load(f'embeddings/embeddings_titles{i}.pt'))

# labels

labels = pd.read_csv(dataset_url, usecols=['label']).values.flatten().tolist()


NameError: name 'input_titles' is not defined

In [None]:
print(len(embeddings))
print(len(embeddings_titles))
print(len(labels))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

input_embeddings = torch.cat(embeddings).detach().numpy()

X_train, X_test, y_train, y_test = train_test_split(input_embeddings, labels, test_size=0.2)

In [None]:
# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f'Random Forest accuracy: {accuracy_score(y_test, y_pred)}')

conf_mat = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 10))
sns.heatmap(conf_mat, annot=True, fmt='d', xticklabels=['H', 'AI'], yticklabels=['H', 'AI'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()


In [None]:
# SVM
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(f'SVM accuracy: {accuracy_score(y_test, y_pred)}')

conf_mat = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 10))
sns.heatmap(conf_mat, annot=True, fmt='d', xticklabels=['H', 'AI'], yticklabels=['H', 'AI'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
# XGBoost
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print(f'XGBoost accuracy: {accuracy_score(y_test, y_pred)}')

#display the confusion matrix
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

conf_mat = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 10))
sns.heatmap(conf_mat, annot=True, fmt='d', xticklabels=['H', 'AI'], yticklabels=['H', 'AI'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
# train xgboost, random forest tree and svm models using both input embeddings and title embeddings

inputs = []
for i in range(0, len(embeddings)):
    inputs.append(torch.cat((embeddings[i], embeddings_titles[i]), dim=1).squeeze().detach().numpy())

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

X_train, X_test, y_train, y_test = train_test_split(inputs, labels, test_size=0.2)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f'Random Forest accuracy: {accuracy_score(y_test, y_pred)}')

# SVM
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(f'SVM accuracy: {accuracy_score(y_test, y_pred)}')

# XGBoost
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print(f'XGBoost accuracy: {accuracy_score(y_test, y_pred)}')
