# SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection
- Task repository: https://github.com/mbzuai-nlp/SemEval2024-task8
- Subtask A. Binary Human-Written vs. Machine-Generated Text Classification
- Subtask B. Multi-Way Machine-Generated Text Classification: Given a full text, determine who generated it. It can be human-written or generated by a specific language model.
- https://semeval.github.io/
- https://huggingface.co/datasets


In [1]:
!pip install gdown
!gdown --folder https://drive.google.com/drive/folders/1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppc

Retrieving folder contents
Processing file 1e_G-9a66AryHxBOwGWhriePYCCa4_29e subtaskA_dev_monolingual.jsonl
Processing file 123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL subtaskA_dev_multilingual.jsonl
Processing file 1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG subtaskA_train_monolingual.jsonl
Processing file 13-9-DakCeLFbPgCiVIU0v6_BCQx0ppz6 subtaskA_train_multilingual.jsonl
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1e_G-9a66AryHxBOwGWhriePYCCa4_29e
To: /content/SubtaskA/subtaskA_dev_monolingual.jsonl
100% 10.8M/10.8M [00:00<00:00, 23.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL
To: /content/SubtaskA/subtaskA_dev_multilingual.jsonl
100% 21.2M/21.2M [00:00<00:00, 172MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG
From (redirected): https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6

In [6]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
import pandas as pd

In [3]:
# Initialize tokenizer and model (resetting the model after cross-validation)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        label = self.labels.iloc[idx]
        encoding = tokenizer(text, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
        return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'labels': torch.tensor(label, dtype=torch.long)}

In [15]:
# Assuming df_A is defined in your script, sample 1000 rows from df_A
df_A = pd.read_json('/content/SubtaskA/subtaskA_train_monolingual.jsonl', lines=True)
df_A = df_A.set_index('id')
df_sampled = df_A.sample(n=1000, random_state=42)  # Fixing seed for reproducibility
df_sampled

Unnamed: 0_level_0,text,label,model,source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
32026,"Babies are incredibly special, with an incredi...",1,davinci,reddit
60190,A child doesn’t change anything about their n...,0,human,wikihow
58257,Use two different alarms; one with a loud bee...,0,human,wikihow
27604,\n\nA DJ mix set is a seamless mix of songs th...,1,cohere,wikihow
98887,"Generally speaking, a computer will slow down ...",0,human,reddit
...,...,...,...,...
6848,Lucien Villa was a French comics writer. He wa...,1,dolly,wikipedia
100401,Not really.\n\nETA: Forgot to link Korean prov...,0,human,reddit
35487,\n\nCollimating a Newtonian telescope is an im...,1,davinci,wikihow
8682,"""Obsession"" is the second single by the Army ...",1,cohere,wikipedia


In [16]:
# Use the entire dataset
texts = df_sampled['text']  # Replace with your text column
labels = df_sampled['label']  # Replace with your label column
full_dataset = CustomDataset(texts, labels)


In [17]:
full_dataset

<__main__.CustomDataset at 0x79995152e530>

In [25]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Placeholder for storing metrics of each fold
metrics_list = []

# KFold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold = 0

for train_index, val_index in kf.split(df_sampled):
    fold += 1
    print(f"Training fold {fold}...")

    # Split into train and validation sets for the current fold
    train_texts, val_texts = texts.iloc[train_index], texts.iloc[val_index]
    train_labels, val_labels = labels.iloc[train_index], labels.iloc[val_index]

    # Create datasets
    train_dataset = CustomDataset(train_texts, train_labels)
    val_dataset = CustomDataset(val_texts, val_labels)

    # Define training arguments
    training_args = TrainingArguments(
        output_dir=f'./results/fold_{fold}',
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir=f'./logs/fold_{fold}',
        logging_steps=10,
        eval_strategy="epoch",
        report_to="none"  # Disable wandb and other reporting integrations
    )

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=lambda eval_pred: {
            'accuracy': accuracy_score(eval_pred.label_ids, eval_pred.predictions.argmax(-1)),
            **dict(zip(['precision', 'recall', 'f1'], precision_recall_fscore_support(
                eval_pred.label_ids, eval_pred.predictions.argmax(-1), average='weighted')[:3]))
        }
    )

    # Train the model
    trainer.train()
    print(f"Completed training for fold {fold}")

    # Evaluate and collect metrics
    eval_results = trainer.evaluate()
    metrics_list.append({
        'fold': fold,
        'accuracy': eval_results['eval_accuracy'],
        'precision': eval_results['eval_precision'],
        'recall': eval_results['eval_recall'],
        'f1_score': eval_results['eval_f1']
    })

# Convert metrics list to a DataFrame and display it
metrics_df = pd.DataFrame(metrics_list)
print(metrics_df)


Training fold 1...




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.4649,0.446152,0.82,0.828403,0.82,0.812667
2,0.2457,0.274689,0.885,0.892533,0.885,0.886073
3,0.1461,0.734154,0.81,0.8507,0.81,0.812005


Completed training for fold 1


Training fold 2...




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2612,0.301696,0.91,0.925882,0.91,0.91057
2,0.0885,0.272339,0.935,0.943711,0.935,0.935405
3,0.0066,0.095099,0.985,0.985517,0.985,0.985033


Completed training for fold 2


Training fold 3...




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.179,0.025087,0.99,0.99,0.99,0.99
2,0.0377,0.016415,0.995,0.995052,0.995,0.995001
3,0.4567,0.519773,0.915,0.926942,0.915,0.914083


Completed training for fold 3


Training fold 4...




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0001,0.000169,1.0,1.0,1.0,1.0
2,0.0825,0.000117,1.0,1.0,1.0,1.0
3,0.0867,0.00011,1.0,1.0,1.0,1.0


Completed training for fold 4


Training fold 5...




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0001,0.087495,0.98,0.98086,0.98,0.980036
2,0.0,0.004761,0.995,0.995056,0.995,0.995003
3,0.0012,0.045055,0.99,0.99022,0.99,0.99001


Completed training for fold 5


   fold  accuracy  precision  recall  f1_score
0     1     0.810   0.850700   0.810  0.812005
1     2     0.985   0.985517   0.985  0.985033
2     3     0.915   0.926942   0.915  0.914083
3     4     1.000   1.000000   1.000  1.000000
4     5     0.990   0.990220   0.990  0.990010


In [28]:

# Define training arguments for the full dataset training
training_args = TrainingArguments(
    output_dir='./final_model',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs/final',
    logging_steps=50,
    report_to="none"  # Disable wandb and other reporting integrations
)

# Initialize Trainer for the full training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=full_dataset
)

# Train the model on the entire dataset
print("Training on the full dataset...")
trainer.train()


Training on the full dataset...


Step,Training Loss
50,0.0244
100,0.0033
150,0.0719
200,0.0316
250,0.0485
300,0.0539
350,0.059


TrainOutput(global_step=375, training_loss=0.043377020438512166, metrics={'train_runtime': 96.9993, 'train_samples_per_second': 30.928, 'train_steps_per_second': 3.866, 'total_flos': 397402195968000.0, 'train_loss': 0.043377020438512166, 'epoch': 3.0})

# Let's save the model in our Drive for the future use

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [30]:
# Save the final model
model.save_pretrained("/content/drive/MyDrive/machine_detector/cross_binary")
tokenizer.save_pretrained("/content/drive/MyDrive/machine_detector/cross_binary")
print("Final model saved.")

Final model saved.


In [31]:
# Load and use the final model for prediction
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

final_model = DistilBertForSequenceClassification.from_pretrained("/content/drive/MyDrive/machine_detector/cross_binary")
final_tokenizer = DistilBertTokenizer.from_pretrained("/content/drive/MyDrive/machine_detector/cross_binary")

# Prediction example on new data
new_texts = ["Sample text for prediction."]
encodings = final_tokenizer(new_texts, truncation=True, padding=True, return_tensors="pt")
outputs = final_model(**encodings)
predictions = torch.argmax(outputs.logits, dim=1)
print(predictions)  # Outputs class predictions


tensor([1])


In [None]:
from google.colab import runtime
runtime.unassign()