<a href="https://colab.research.google.com/github/Anas-Ah25/Text-AI-Detection-and-Plagiarism/blob/main/notebooks/distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

thedrcat_daigt_proper_train_dataset_path = kagglehub.dataset_download('thedrcat/daigt-proper-train-dataset')
tasneemmahmed_distilbert_model_pth_path = kagglehub.dataset_download('tasneemmahmed/distilbert-model-pth')

print('Data source import complete.')


In [None]:
!pip install transformers datasets torch

Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
  Downloading nvidia_cusparse_cu12-12.3.1.170-p

In [None]:
import pandas as pd
import os

import torch
from torch.utils.data import Dataset,  DataLoader
from sklearn.model_selection import train_test_split
from transformers import DistilBertModel
from transformers import DistilBertTokenizer
import torch.nn as nn
from tqdm.auto import tqdm

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# load DAIGT dataset
data_dir = '/kaggle/input/daigt-proper-train-dataset'

all_files = [f for f in os.listdir(data_dir)]
all_files.sort()

# concatenate
df_list = []
for file in all_files:
    file_path = os.path.join(data_dir, file)
    df = pd.read_csv(file_path)
    df = df[['text', 'label']]
    df_list.append(df)

df = pd.concat(df_list, ignore_index=True)

print(df.head())

                                                text  label
0  There are alot reasons to keep our the despise...      0
1  Driving smart cars that drive by themself has ...      0
2  Dear Principal,\n\nI believe that students at ...      0
3  Dear Principal,\n\nCommunity service should no...      0
4  My argument for the development of the driverl...      0


In [None]:
len(df)

159456

In [None]:
# Sample 40,000 from each class
class_0 = df[df['label'] == 0]
class_1 = df[df['label'] == 1]

sample_size = 40000

sample_0 = class_0.sample(n=sample_size, random_state=42, replace=False)
sample_1 = class_1.sample(n=sample_size, random_state=42, replace=False)

df_sampled = pd.concat([sample_0, sample_1]).sample(frac=1, random_state=42).reset_index(drop=True)

print(df_sampled['label'].value_counts())

label
1    40000
0    40000
Name: count, dtype: int64


In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df_sampled['text'].tolist(),
    df_sampled['label'].tolist(),
    test_size=0.2,
    stratify=df_sampled['label'],
    random_state=42
)

In [None]:
train_texts[0]

"To the Principal,\n\nI think you are trying to do the write thing by not allowing kids under a grade B participate in any after school sports even though most kids in this school have a C average. I feel that if you say you need to get a B average to play sports that would be the write thing to do. It's smart to make kids get better grades to participate in after school sports because if they really want to do the sports then there going to want to get better grades and to do well in school.\n\nIts great for the school and its great for the students to get great grades.\n\nThe students that do not agree with this policy will not earn the right to play sports, so its either get good grades and play or get bad grades and not play.\n\nYou could reduce the amount of kids that do not agree with this policy maybe if you don't make it sound like a challenge and you make it sound fun to get B' s in school You could even hold an after school sports activity for getting the highest grade in you

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
max_length = 256

In [None]:
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

In [None]:
train_dataset = TextDataset(train_texts, train_labels, tokenizer)
val_dataset = TextDataset(val_texts, val_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4)

### Load DistilBERT

In [None]:
class DistilBERT(nn.Module):
    def __init__(self):
        super(DistilBERT, self).__init__()
        self.distilbert = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.classifier = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(768, 2) #the classifier head
        )
        # the base layers are freezzed since we will keep only the classifier head to be trained
        for param in self.distilbert.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask):
        outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]   # shape :(batch_size, seq_length, hidden_size)
        return self.classifier(cls_output)

## Training

In [None]:
model = DistilBERT().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
def train(model, train_loader, val_loader, optimizer, loss_fn, epochs=3):
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}", leave=False)

        for batch in progress_bar:
            optimizer.zero_grad()

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            logits = model(input_ids, attention_mask)
            loss = loss_fn(logits, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            progress_bar.set_postfix(loss=loss.item())

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")

        val_acc = evaluate(model, val_loader)
        print(f"Validation Accuracy: {val_acc:.4f}\n")

In [None]:
def evaluate(model, dataloader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        progress_bar = tqdm(dataloader, desc="Evaluating", leave=False)
        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            logits = model(input_ids, attention_mask)
            predictions = torch.argmax(logits, dim=1)
            correct += (predictions == labels).sum().item()
            total += labels.size(0)

    accuracy = correct / total
    return accuracy

In [None]:
train(model, train_loader, val_loader, optimizer, loss_fn, epochs=3)

Epoch 1/3:   0%|          | 0/16000 [00:00<?, ?it/s]

Epoch 1 completed. Average Loss: 0.1395


Evaluating:   0%|          | 0/4000 [00:00<?, ?it/s]

Validation Accuracy: 0.9625



Epoch 2/3:   0%|          | 0/16000 [00:00<?, ?it/s]

Epoch 2 completed. Average Loss: 0.0723


Evaluating:   0%|          | 0/4000 [00:00<?, ?it/s]

Validation Accuracy: 0.9809



Epoch 3/3:   0%|          | 0/16000 [00:00<?, ?it/s]

Epoch 3 completed. Average Loss: 0.0599


Evaluating:   0%|          | 0/4000 [00:00<?, ?it/s]

Validation Accuracy: 0.9844



In [None]:
model_save_path = "/kaggle/working/distilbert_model.pth"
torch.save(model.state_dict(), model_save_path)
print(f"Model saved to {model_save_path}")

Model saved to /kaggle/working/distilbert_model.pth


In [None]:
tokenizer_save_path = "/kaggle/working/distilbert_tokenizer/"
tokenizer.save_pretrained(tokenizer_save_path)
print(f"Tokenizer saved to {tokenizer_save_path}")

Tokenizer saved to /kaggle/working/distilbert_tokenizer/


In [None]:
checkpoint = {
    'epoch': 3,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss':loss_fn
}

torch.save(checkpoint, "/kaggle/working/distilbert_checkpoint1.pth")
print("Training checkpoint saved.")

Training checkpoint saved.


In [None]:
def predict(text, model, tokenizer, device, max_length=256):
    model.eval()
    encoding = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']
        outputs = model(input_ids, attention_mask)
        probs = torch.softmax(outputs, dim=1)
        prediction = torch.argmax(probs, dim=1).item()

    return "AI-generated" if prediction == 1 else "Human written", probs.cpu().numpy()

In [None]:
test_acc = evaluate(model, val_loader)
print(f"Test Accuracy: {test_acc:.4f}")

Evaluating:   0%|          | 0/4000 [00:00<?, ?it/s]

Test Accuracy: 0.9844


In [None]:
human= """dear  Dr. mike,
i hope this mail finds you well, i wanted to ask you regarding my grades in the midterm, also i wanted to mention some concerns from the TA named Osama,
he is behaving in a very bad way, making non-sense in assigments and tasks and evaluating us in wrong criteria,
also he don't consider the time at all, so please i want you to discuss with him the validty of all of these actions,
also the midterm grade, i think that i solved well!! , how can i lose all of this grades in one time"""

ai = """Dear Dr. Mike,

I hope this email finds you well.

I recently received my grades, and I wanted to reach out for some clarification regarding my performance. I would really appreciate any insights you could share on why I received these specific marks, as understanding my mistakes will help me improve in the future.

Additionally, I have some concerns regarding the assistance provided by TA Osama. I feel that certain aspects of the grading or support might not have been as clear or fair as expected. If there’s a chance to discuss this, I would be grateful for your perspective on how this might have impacted my performance.

Thank you for your time and guidance. I look forward to your response. """

ai_pred, ai_probs = predict(ai, model, tokenizer, device)
human_pred, human_probs = predict(human, model, tokenizer, device)

print("AI Text Prediction:", ai_pred)
print(f"AI Text Confidence: {ai_probs[0][1] * 100:.4f}%")

print("\nHuman Text Prediction:", human_pred)
print(f"Human Text Confidence: {human_probs[0][0] * 100:.4f}%")

AI Text Prediction: AI-generated
AI Text Confidence: 99.2939%

Human Text Prediction: Human written
Human Text Confidence: 99.9834%


In [None]:
# #to complete training from a checkpoint
# checkpoint = torch.load("/kaggle/working/distilbert_checkpoint1.pth")

# model.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
# start_epoch = checkpoint['epoch']
# loss = checkpoint['loss']

In [None]:
from IPython.display import FileLink, display

display(FileLink('distilbert_model.pth'))