<a href="https://colab.research.google.com/github/Aleem246/plagiarism/blob/main/textclassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data download with kaggle api


In [1]:
#upload kaggle.json file which is downloaded from the kaggle site for the api key and access
from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"dineshreddyd","key":"859ddbf96b99a4d66feb40796fb94cc2"}'}

In [2]:
# giving permissions to download and unzip the dataset
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# download the dataset
!kaggle datasets download -d dillonwongso/ai-generated-vs-human-text-cleaned

Dataset URL: https://www.kaggle.com/datasets/dillonwongso/ai-generated-vs-human-text-cleaned
License(s): MIT
Downloading ai-generated-vs-human-text-cleaned.zip to /content
 99% 161M/162M [00:02<00:00, 80.3MB/s]
100% 162M/162M [00:02<00:00, 71.2MB/s]


In [3]:
!unzip ai-generated-vs-human-text-cleaned.zip


Archive:  ai-generated-vs-human-text-cleaned.zip
  inflating: preprocessed-50k.csv    
  inflating: preprocessed.csv        


### Download The Libraries

In [4]:
# Installing libraries needed
!pip install transformers




## model development

### import of libraries


In [5]:
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizerFast
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

### load dataset

In [6]:
# Load the dataset
data = pd.read_csv("preprocessed-50k.csv")

print(data.head())

                                                text source
0  Ahh.... must take Ooraks bright that makes hur...  human
1  Overlay the default /r/Doom subreddit styles w...  human
2  Six Sigma in Pharmaceutical Business Operation...  human
3  So it's true. \n We are just an intergalactic ...  human
4  I stared blankly down at the multiple choice a...  human


In [7]:
data['source'] = data['source'].map({'human': 1, 'ai': 0})

In [8]:
texts = data['text'].values
labels = data['source'].values

### initialize tokenizer

In [9]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [10]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def tokenize(texts, tokenizer, batch_size=10000, max_length=256):
    n = len(texts)
    print(f"Total texts: {n}")
    all_input_ids = []
    all_attention_masks = []

    for i in range(0, n, batch_size):
        print(f"Processing batch {i // batch_size + 1}")
        batch = texts[i:i + batch_size]
        batch_encoding = tokenizer(
            list(batch),
            max_length=max_length,
            truncation=True,
            padding='max_length',
            return_tensors="tf"
        )
        all_input_ids.append(batch_encoding['input_ids'])
        all_attention_masks.append(batch_encoding['attention_mask'])

    # Concatenate all batches into a single tensor
    return {
        "input_ids": tf.concat(all_input_ids, axis=0),
        "attention_mask": tf.concat(all_attention_masks, axis=0)
    }


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### splitting the data

In [11]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

In [12]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


### model development and compiling

In [13]:
train_encodings = tokenize(train_texts, tokenizer)
test_encodings = tokenize(test_texts, tokenizer)


Total texts: 40000
Processing batch 1
Processing batch 2
Processing batch 3
Processing batch 4
Total texts: 10000
Processing batch 1


In [14]:
train_encodings = {key: value.numpy() for key, value in train_encodings.items()}
print({key: len(value) for key, value in train_encodings.items()})


{'input_ids': 40000, 'attention_mask': 40000}


In [15]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).shuffle(len(train_labels)).batch(16)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(16)

In [16]:
model = TFBertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### model

In [17]:
history = model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


### predictions

In [18]:
test_loss, test_accuracy = model.evaluate(test_dataset)
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.8944000005722046


In [19]:
predictions = model.predict(test_dataset).logits
predicted_classes = np.argmax(predictions, axis=1)

print(classification_report(test_labels, predicted_classes))

              precision    recall  f1-score   support

           0       0.83      0.99      0.90      4978
           1       0.99      0.80      0.88      5022

    accuracy                           0.89     10000
   macro avg       0.91      0.89      0.89     10000
weighted avg       0.91      0.89      0.89     10000



### saving the model

In [20]:
model.save_pretrained("ai_human_classifier")
tokenizer.save_pretrained("ai_human_classifier")

('ai_human_classifier/tokenizer_config.json',
 'ai_human_classifier/special_tokens_map.json',
 'ai_human_classifier/vocab.txt',
 'ai_human_classifier/added_tokens.json',
 'ai_human_classifier/tokenizer.json')