# Annotated Notebook for SciBERT Fine-Tuning for Text Classification

This notebook demonstrates the complete workflow to fine-tune SciBERT for detecting automatically generated research abstracts using spaCy. Each block of code is thoroughly explained in Markdown to provide clarity on its purpose and functionality.

## 1. Importing Libraries and Setting Up the Environment

In this section, we import the necessary libraries for data manipulation, model training, and evaluation. We also verify the spaCy version to ensure compatibility.

In [1]:
!pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.8.0->spacy-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.8.0->spacy-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.8.0->spacy-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.8.0->spacy-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py

In [2]:
# Import key libraries
import spacy
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

# Print spaCy version to verify compatibility
print('spaCy version:', spacy.__version__)

spaCy version: 3.7.5


## 2. Dataset Preparation

### 2.1 Loading and Inspecting Data

Here we load our dataset (assumed to be in CSV format) which contains research abstracts labeled as either "human_written" or "machine_generated". We print the shape and the first few rows of the dataframe to confirm that the data has been loaded correctly.

In [3]:
!wget https://github.com/vijini/GeneratedTextDetection/archive/refs/heads/main.zip
!unzip main

--2025-02-27 22:21:40--  https://github.com/vijini/GeneratedTextDetection/archive/refs/heads/main.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/vijini/GeneratedTextDetection/zip/refs/heads/main [following]
--2025-02-27 22:21:41--  https://codeload.github.com/vijini/GeneratedTextDetection/zip/refs/heads/main
Resolving codeload.github.com (codeload.github.com)... 20.205.243.165
Connecting to codeload.github.com (codeload.github.com)|20.205.243.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘main.zip’

main.zip                [      <=>           ] 800.25K   712KB/s    in 1.1s    

2025-02-27 22:21:43 (712 KB/s) - ‘main.zip’ saved [819461]

Archive:  main.zip
ab034465f857a93212a894fe598edb749345b6ff
   creating: GeneratedTextDetection-main/
  inflatin

In [4]:
## We use the fully generated dataset
dataset_path = Path("GeneratedTextDetection-main/Dataset/FullyGenerated")

texts, labels = [], []

for original_file in dataset_path.glob("*_original.txt"):
    generated_file = original_file.with_name(original_file.stem.replace("original", "generated") + ".txt")

    texts.append(original_file.read_text(encoding="utf-8"))
    labels.append("human_written")

    texts.append(generated_file.read_text(encoding="utf-8"))
    labels.append("machine_generated")

df = pd.DataFrame({"text": texts, "label": labels})
df.to_csv("dataset.csv", index=False)

In [5]:
# Load the dataset
df = pd.read_csv("dataset.csv")
print("Dataset shape:", df.shape)
print("First few entries:\n", df.head())

Dataset shape: (200, 2)
First few entries:
                                                 text              label
0  ﻿Abstract We present the task of Automated Pun...      human_written
1  Abstract We present the task of Automated Puni...  machine_generated
2  ﻿Abstract Pre-trained language models (PLM) ha...      human_written
3  Abstract Pre-trained language models (PLM) hav...  machine_generated
4  ﻿Abstract Improving user experience of a dialo...      human_written


### 2.2 Splitting the Dataset

We split the data into training (80%) and testing (20%) sets, using stratification on the labels to maintain class balance. The resulting dataframes are saved as CSV files for later conversion into spaCy's binary format.

In [6]:
# Split the dataset into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

# Create DataFrames for train and test sets
train_df = pd.DataFrame({"text": train_texts, "label": train_labels})
test_df = pd.DataFrame({"text": test_texts, "label": test_labels})

# Save the DataFrames to CSV files
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)

### 2.3 Converting Data to spaCy Format

We now convert the CSV data into spaCy `Doc` objects. Using the `DocBin` utility, we serialize these documents into binary files (`train.spacy` and `test.spacy`) which are used later for training.

In [7]:
import spacy
from spacy.tokens import DocBin

# Create a blank English model
nlp = spacy.blank("en")

def create_docbin(df):
    doc_bin = DocBin()
    for _, row in df.iterrows():
        doc = nlp.make_doc(row["text"])
        # Initialize document categories
        doc.cats = {"human_written": 0.0, "machine_generated": 0.0}
        doc.cats[row["label"]] = 1.0
        doc_bin.add(doc)
    return doc_bin

# Convert the CSV data to spaCy binary format
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
create_docbin(train_df).to_disk("train.spacy")
create_docbin(test_df).to_disk("test.spacy")

## 3. Model Configuration and Training

### 3.1 Configuring the Pipeline

Our pipeline comprises two components:
- **Transformer Component:** Loads the pretrained SciBERT model (`allenai/scibert_scivocab_uncased`) to generate contextual embeddings.
- **Text Classification Component:** Uses spaCy’s `TextCatEnsemble.v2` architecture (with a bag-of-words submodel and Transformer listener) to classify texts as either human-written or machine-generated.

The configuration for these components is stored in a separate file (`config.cfg`). Ensure that your `config.cfg` is set up as required before training. You can load it with this line also

In [8]:
!python -m spacy init config config.cfg --lang en --pipeline transformer,textcat --optimize accuracy --gpu


[38;5;1m✘ The provided output file already exists. To force overwriting the
config file, set the --force or -F flag.[0m



Then you need to replace the base model for the transformer with "allenai/scibert_scivocab_uncased" and provide the path for the training and test data

### 3.2 Training the Model

To train the model, we run spaCy's training command. This command loads the configuration file, initializes the pipeline (including the SciBERT and text classification components), and starts the training process on the GPU.

In [9]:
# Execute the training command (this cell is for documentation; run in terminal or as a shell cell)
!python -m spacy train config.cfg --output ./output --gpu-id 0

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
config.json: 100% 385/385 [00:00<00:00, 2.47MB/s]
vocab.txt: 100% 228k/228k [00:00<00:00, 29.1MB/s]
2025-02-27 22:22:16.217251: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740694936.503654    1069 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740694936.582086    1069 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-27 22:22:17.195113: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the follo

## 4. Model Evaluation

### 4.1 Evaluating on the Test Set

After training, we evaluate the model on the test set using spaCy's evaluation command. This provides key metrics such as accuracy, precision, recall, and F1-score.

In [10]:
# Evaluate the trained model on the test set
!python -m spacy evaluate ./output/model-best test.spacy

[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
2025-02-27 22:38:47.565097: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740695927.586168    5185 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740695927.592345    5185 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-27 22:38:47.615325: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  self._model.load_state_dict(torch.lo

### 4.2 Inference on a Sample Text

We load the trained model and perform inference on a sample abstract to see the predicted class probabilities.

In [11]:
# Load the best model
nlp = spacy.load("./output/model-best")

# Sample text for inference
sample_text = "Insert a sample research abstract here to test model predictions."
doc = nlp(sample_text)
print("Predicted class probabilities:", doc.cats)

# Determine predicted label based on a threshold (e.g., 0.5)
predicted_label = "human_written" if doc.cats.get("human_written", 0) > 0.5 else "machine_generated"
print("Predicted label:", predicted_label)

  self._model.load_state_dict(torch.load(filelike, map_location=device))


Predicted class probabilities: {'human_written': 0.9976346492767334, 'machine_generated': 0.002365301363170147}
Predicted label: human_written


  with torch.cuda.amp.autocast(self._mixed_precision):


In [12]:
#let's evaluate it on our test dataset
human_text = test_df["text"][0]
doc = nlp(human_text)
print(doc.cats) #prdict to machine generated (right label)

machine_text = test_df["text"][1]
doc = nlp(machine_text)
print(doc.cats) #predict to human written (right label)

{'human_written': 0.9976348876953125, 'machine_generated': 0.0023650594521313906}
{'human_written': 1.1979941518802661e-05, 'machine_generated': 0.9999880790710449}


### 4.3 Deeper on the evaluation

This block of code performs an error analysis by identifying misclassified examples from the test dataset. Here’s what it does step by step.

This analysis helps us pinpoint specific cases where the model's predictions differ from the ground truth, which can be useful for understanding errors and guiding further improvements.


In [13]:
# Initialize a list to collect misclassified examples
misclassified = []

# Iterate over each test document
for idx, row in test_df.iterrows():
    text = row["text"]
    true_label = row["label"]

    # Process the text with the model
    doc = nlp(text)

    # Determine predicted label based on a threshold (0.5 here)
    pred_label = "human_written" if doc.cats.get("human_written", 0) > 0.5 else "machine_generated"

    # If the prediction doesn't match the true label, store it
    if pred_label != true_label:
        misclassified.append({
            "index": idx,
            "text": text,
            "true_label": true_label,
            "predicted_label": pred_label,
            "probabilities": doc.cats  # Optionally include the probability scores
        })

# Report the results
print(f"Found {len(misclassified)} misclassified examples.\n")
for example in misclassified:
    print(f"Index: {example['index']}")
    print(f"True Label: {example['true_label']} | Predicted Label: {example['predicted_label']}")
    print(f"Probabilities: {example['probabilities']}")
    print(f"Text: {example['text']}")
    print("-" * 80)

Found 4 misclassified examples.

Index: 8
True Label: machine_generated | Predicted Label: human_written
Probabilities: {'human_written': 0.9976348876953125, 'machine_generated': 0.002365087391808629}
Text: Abstract Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One fac- tor that significantly affects the performance of NMT is the availability of high-quality paral- lel corpora. However, high-quality parallel cor- pora concerning Korean are relatively scarce compared to those associated with other high- resource languages, such as German or Italian. To address this problem, AI Hub recently re- leased seven types of parallel corpora for Ko- rean. In this study, we conduct an in-depth ver- ification of the quality of corresponding par- allel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant ex- periments. LIWC is a word-counting software prog

* We can observe the model has only misclassify on machine generated text, highlighting the difficulty to detect machine generated as it compared to human generated text.

* One other point is the high confidence score of the model on it's prediction, that is the point to investigate in future work.


#### Dependency tree depth

### Dependency Tree Depth Analysis

This block of code calculates and visualizes the **average dependency tree depth** for sentences within documents. The dependency tree depth is a measure of the syntactic complexity of a sentence. In a dependency tree, each word is connected to its dependents (i.e., the words it governs). The depth of this tree represents the number of layers of dependencies from the root word to the most distant leaf node. A higher tree depth can indicate a more complex sentence structure, which might be a useful feature for distinguishing between human-written and machine-generated text.

#### Code Breakdown:

1. **Recursive Function `tree_depth`:**
   - **Purpose:** Computes the depth of the dependency tree starting from a given token.
   - **Mechanism:**  
     - It recursively examines the children of the token.  
     - If a token has no children, its depth is `1`.  
     - Otherwise, it returns `1` plus the maximum depth among all its children.

2. **Function `average_tree_depth`:**
   - **Purpose:** Calculates the average dependency tree depth for all sentences in a document.
   - **Mechanism:**  
     - It iterates over each sentence in the document (using `doc.sents`).  
     - For each sentence, it identifies the root token (the token whose head is itself) and computes its tree depth using the `tree_depth` function.  
     - The function returns the average depth across all sentences in the document.

3. **Sentence Segmentation:**
   - The `sentencizer` component is added to the pipeline (`nlp.add_pipe("sentencizer")`) to ensure that the document is properly segmented into sentences for analysis.

4. **Data Collection and Plotting:**
   - For a subset of test documents, the code computes the average dependency tree depth and stores these values in a dictionary (`depths_data`) categorized by the document label (either `"human_written"` or `"machine_generated"`).
   - The collected depths are then transformed into a DataFrame (`depths_df`) and visualized using a boxplot. This plot compares the distribution of average tree depths across the two categories.

#### Utility:

By comparing the average dependency tree depths of human-written versus machine-generated texts, we can explore whether one class tends to have more syntactically complex sentences than the other. Such insights can inform further refinements in our classification approach or serve as additional features for downstream tasks.

The full code block for this analysis is as follows:


In [None]:
def tree_depth(token):
    """Recursively compute the depth of the dependency tree starting from the given token."""
    children = list(token.children)
    if not children:
        return 1
    return 1 + max(tree_depth(child) for child in children)

def average_tree_depth(doc):
    """Compute average dependency tree depth for all sentences in a document."""
    depths = []
    for sent in doc.sents:
        # Identify the root of the sentence
        roots = [token for token in sent if token.head == token]
        if roots:
            depths.append(tree_depth(roots[0]))
    return sum(depths)/len(depths) if depths else 0

nlp.add_pipe("sentencizer")

# Collect average dependency tree depth for each text by label
depths_data = {"human_written": [], "machine_generated": []}
for idx, row in test_df.iterrows():
    text = row["text"]
    label = row["label"]
    doc = nlp(text)
    avg_depth = average_tree_depth(doc)
    depths_data[label].append(avg_depth)

# Create a DataFrame for plotting
data = []
for label in depths_data:
    for d in depths_data[label]:
        data.append({"Label": label, "Average_Depth": d})
depths_df = pd.DataFrame(data)

# Plot a boxplot comparing average dependency tree depth by label
plt.figure(figsize=(8,6))
sns.boxplot(x="Label", y="Average_Depth", data=depths_df)
plt.title("Average Dependency Tree Depth by Text Label")
plt.xlabel("Text Label")
plt.ylabel("Average Tree Depth")
plt.show()

The two graphs are someway similar, making the classification task harder.

## 5. Packaging and Deployment

### 5.1 Packaging the Trained Model

Once training is complete, we package the trained model into an installable distribution. The following command uses spaCy's packaging tool to generate a wheel file that can be uploaded to GitHub Releases or another hosting service. (the tar.gz files containing our model is also on the zip files of the projects, you can use it to directly use our model without training it first)

```bash
python -m spacy package ./output/model-best ./my_model_package --build wheel --create-meta
```

After packaging, locate the generated wheel file (inside the `dist/` folder of the created package directory) for distribution.

In [1]:
!python -m spacy package ./output/model-best ./my_model_package --build wheel --create-meta

[38;5;3m⚠ Generating packages without the 'build' package is deprecated and
will not be supported in the future. To install 'build': pip install build[0m
[38;5;4mℹ Building package artifacts: wheel[0m
2025-02-27 22:56:17.144210: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740696977.164471    9627 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740696977.170539    9627 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-27 22:56:17.191226: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FM

### 5.2 Installing the Packaged Model on a Remote Machine

To install the packaged model on a remote machine, upload the tar.gz file to GitHub Releases. Then, use the following command on the remote machine:


This allows users to install your model directly via pip without having access to the local package folder.

In [None]:
!pip install https://github.com/HermesNdjeng/nlp_project_ngn/releases/download/model/en_model_textcat_ngn-0.1.0.tar.gz

Collecting https://github.com/HermesNdjeng/nlp_project_ngn/releases/download/model/en_model_textcat_ngn-0.1.0.tar.gz
  Downloading https://github.com/HermesNdjeng/nlp_project_ngn/releases/download/model/en_model_textcat_ngn-0.1.0.tar.gz (415.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.4/415.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: en_model_textcat_ngn



In [None]:
nlp = spacy.load("en_model_textcat_ngn")

#you can use your model then

## Conclusion

This notebook provided a detailed walkthrough for fine-tuning a SciBERT model to detect automatically generated research abstracts using spaCy. We covered:

- Data loading and preparation
- Splitting the dataset and converting it to spaCy's binary format
- Configuring the pipeline with a Transformer and text classification component
- Training, evaluating, and performing inference with the model
- Packaging and deploying the trained model

Each step has been annotated to ensure transparency and reproducibility of the methodology. Feel free to adapt and expand these notes for further experiments.