## [**About Dataset - Kaggle**](https://www.kaggle.com/datasets/mahdimaktabdar/chatgpt-classification-dataset)

We have compiled a dataset that consists of textual articles including common terminology, concepts and definitions in the field of computer science, artificial intelligence, and cyber security. This dataset consists of both human-generated text and OpenAI’s ChatGPT-generated text. Human-generated answers were collected from different computer science dictionaries and encyclopedias including “The Encyclopedia of Computer Science and Technology” and "Encyclopedia of Human-Computer Interaction".

AI-generated content in our dataset was produced by simply posting questions to OpenAI’s ChatGPT and manually documenting the resulting responses. A rigorous data-cleaning process has been performed to remove unwanted Unicode characters, styling and formatting tags. To structure our dataset for binary classification, we combined both AI-generated and Human-generated answers into a single column and assigned appropriate labels to each data point (Human-generated = 0 and AI-generated = 1).

This constructs our sentence-level dataset (sentence_level_data.csv) which consists of a total of 7344 entries (4008 AI-generated and 3336 Human-generated).

We appreciate it, if you cite the following article if you happen to use this dataset in any scientific publication:

Maktab Dar Oghaz, M., Dhame, K., Singaram, G., & Babu Saheer, L. (2023). Detection and Classification of ChatGPT Generated Contents Using Deep Transformer Models. IEEEAccess.
https://www.techrxiv.org/articles/preprint/Detection_and_Classification_of_ChatGPT_Generated_Contents_Using_Deep_Transformer_Models/23895951

In [13]:
import pandas as pd
import random
import os
from sklearn.model_selection import train_test_split

In [14]:
# Transforma string em inteiro
def nome_para_inteiro(nome):
    nome = nome.upper()  # Converter para maiúsculas para tratar maiúsculas e minúsculas da mesma forma
    valor_inteiro = 0
    
    for letra in nome:
        # Verificar se a letra está no alfabeto (A a Z)
        if 'A' <= letra <= 'Z':
            valor_inteiro += ord(letra) - ord('A') + 1  # Valor de 'A' é 1, 'B' é 2, ..., 'Z' é 26
    
    return valor_inteiro

In [15]:
os.listdir()

['Airline Passenger Reviews.csv',
 'Cria base de dados Treino e Teste - QUARTETO.ipynb',
 'Cria base de dados Treino e Teste - TRIO.ipynb',
 'dados_teste_TRIO_silasapb.csv',
 'dados_treino_TRIO_silasapb.csv',
 'Projeto1 Template_vf.ipynb',
 'sentence_level_data.csv']

In [16]:
# Ler o arquivo de dados 
dados = pd.read_csv('sentence_level_data.csv', encoding='utf-8-sig').iloc[:,1:]

#Human-generated = 0 and AI-generated = 1
dados.loc[dados['class']==0,'target'] = 'HG'
dados.loc[dados['class']==1,'target'] = 'AI'
dados.drop(columns=['class'], inplace=True)
dados

Unnamed: 0,sentence,target
0,NLP is a multidisciplinary field that draws fr...,HG
1,"In terms of linguistics, a program must be abl...",HG
2,Of course each language has its own forms of a...,HG
3,Programs can use several strategies for dealin...,HG
4,As formidable as the task of extracting the co...,HG
...,...,...
7339,This involves minimizing the number of registe...,AI
7340,Instruction Scheduling: The compiler reorders ...,AI
7341,Code Emission: The compiler generates the fina...,AI
7342,Optimization: The compiler may perform additio...,AI


In [17]:
nome = input("Digite seu username do Insper: ")
valor_inteiro = nome_para_inteiro(nome)

In [18]:
valor_inteiro

31

In [19]:
# Embaralhar as linhas do conjunto de dados
dados_embaralhados = dados.sample(frac=2100/dados.shape[0], random_state=valor_inteiro)
dados_embaralhados.columns

Index(['sentence', 'target'], dtype='object')

In [20]:
# Separar os dados de treino e teste 
X = dados_embaralhados.sentence
y = dados_embaralhados.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=valor_inteiro)

dados_treino = pd.concat([X_train, y_train],axis=1)
dados_treino.columns = ['sentence', 'target']
dados_teste = pd.concat([X_test, y_test],axis=1)
dados_teste.columns = ['sentence', 'target']
   

In [21]:
print(dados_treino.target.value_counts(True))
print(dados_teste.target.value_counts(True))

target
AI    0.536735
HG    0.463265
Name: proportion, dtype: float64
target
AI    0.544444
HG    0.455556
Name: proportion, dtype: float64


In [22]:
# Salvar os dados de treino e teste em arquivos CSV
nome_arquivo_treino = 'dados_treino_TRIO_'+nome+'.csv'
nome_arquivo_teste = 'dados_teste_TRIO_'+nome+'.csv'

dados_treino.to_csv(nome_arquivo_treino,index = False, header=True)
dados_teste.to_csv(nome_arquivo_teste, index=False)#, engine='xlsxwriter')

print(f"Dados de treino e teste foram salvos em '{nome_arquivo_treino}' e '{nome_arquivo_teste}'.")

Dados de treino e teste foram salvos em 'dados_treino_TRIO_uai.csv' e 'dados_teste_TRIO_uai.csv'.


In [23]:
# COMANDO QUE DEVE COLOCAR NO TEMPLATE Projeto1_Template.ipynb para LER a base de dados TREINO
# Faça adaptações digitando o nome (username) considerado
pd.read_csv('dados_treino_TRIO_'+nome+'.csv')

Unnamed: 0,sentence,target
0,Memory allocation for binary trees is typicall...,AI
1,Seeing these potential pitfalls requires think...,HG
2,"They can be simple or complex, and can range f...",AI
3,"For example, web browsers use a cache to store...",AI
4,"Within the DMZ one can find:firewalls,choke an...",HG
...,...,...
1465,Small size and lightweight: Optical fibers are...,AI
1466,"It also has a vast standard library, which pro...",AI
1467,There are different types of affiliate marketi...,AI
1468,OGG (Ogg Vorbis): This is an open-source compr...,AI


In [24]:
# COMANDO QUE DEVE COLOCAR NO TEMPLATE Projeto1_Template.ipynb para LER a base de dados TESTE
# Faça adaptações digitando o nome (username) considerado
pd.read_csv('dados_teste_TRIO_'+nome+'.csv')

Unnamed: 0,sentence,target
0,"CMYK stands for cyan, magenta, yellow, and black.",HG
1,Optimization: Scientific computing is used to ...,AI
2,Scatter chart: A chart that plots two variable...,AI
3,Without the semicolons and braces found in C a...,HG
4,"For example, a print spooler demon looks for j...",HG
...,...,...
625,Knowledge bases: These are searchable database...,AI
626,"However, tape drives have continued to evolve,...",AI
627,"Bill Gates is an American entrepreneur, softwa...",AI
628,Software development generally follows a stand...,AI
