# **Notebook D.** Classification Using a Transformer Model
----

One of the biggest recent developments in Natural Language Processing has come from the introduction of Transformer models (e.g. *BERT, EARNIE, RoBERTa*, etc.). The idea is that a model is trained on a very large corpus and used to create an embedding represention of words. This "raw" model can be downloaded and then fine-tuned (retrained) on our own data.

There are several ways to implement these models. Researchers who are most comfortable with Python may start with the **Transformers** library by **HuggingFace** (https://huggingface.co/transformers/). This is the most flexible approach, but it also requires effort for researchers to implement.

An alterantive is the **SimpleTransformers** library which is a wrapper for this functionality. This library contains an easy-to-use version of this transformer technique (https://simpletransformers.ai) that is similar to the sklearn commands we have used thus far.

This package is not pre-installed with colab. To do this, we need perform the following:
 - 1) run *!pip install simpletransformers* in the notebook below
 - 2) Comment out the code by putting a # in front of the line (e.g. *#!pip install simpletransformers*)
 - 3) Rerun all of the code from the top menu (or hit Ctrl+F9)




Models: https://simpletransformers.ai/docs/classification-specifics/#supported-model-types

Model Options: https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model


In [1]:
!pip install transformers==4.31.0 simpletransformers==0.64.3 --upgrade

Collecting transformers==4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl.metadata (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.9/116.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting simpletransformers==0.64.3
  Downloading simpletransformers-0.64.3-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.31.0)
  Downloading tokenizers-0.13.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting seqeval (from simpletransformers==0.64.3)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting streamlit (from simpletransformers==0.64.3)
  Downloading streamlit-1.48

In [2]:
# Load the Simple Transformers Package for Text Classification

from simpletransformers.classification import ClassificationModel

In [3]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
# Remove when testing your own code #
import warnings
warnings.filterwarnings("ignore")

In [4]:
import logging

logging.basicConfig(level=logging.ERROR)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

# D.1. Preamble: Load Packages
---

In [5]:
# General Packages #
import os
import pandas as pd
import numpy as np

# TQDM to Show Progress Bars #
from tqdm import tqdm
from tqdm.notebook import tqdm as tqdm_notebook

# SKLearn libraries for splitting sample and validation
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold, cross_val_predict
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Additional Libraries that we are using only in this notebook
import torch
import gc

In [6]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
# Remove when testing your own code #
import warnings
warnings.filterwarnings("ignore")

In [7]:
# Mount Personal Google Drive on own Machine -- You have to follow the link to log in #
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# D.2. Load Training Data ##
----------------

We are going to use the data on the Google drive. This is in a csv file, and so we are going to load the data as a dataframe, and then convert the main data (Patent Ids, Indicator for AI / Non-AI, Patent Abstract) from a Pandas DataFrame to a list (which is more easily used in later sections).

In [8]:
# Change to Working Directory with Training Data #
os.chdir("/content/drive/MyDrive/USPTO_data")

# Load Training Data #
TrainingData = pd.read_csv("./Training_Data/4K Patents - AI 20p.csv")

# Store Data in Lists for Text Classification #
IDs = np.array(TrainingData['app number'].values.tolist())
Abstract_Text = TrainingData['abstract'].values.tolist()
Classes = TrainingData['actual'].values.tolist()

# D.3. Perform Classification with Transformer Model
---

As before, we are going to go through different models and compare their performance. Recall that transformer models are pre-trained by an external entity and we are simply downloading them (pre-trained) from the web and fine tuning them our particular application.

We download these models from hugging face. We are using the simpletransformers library which allows us to automatically download and train these models, using the same basic command for different models. We simply need to specify the model architecture (e.g. BERT) and then specific model, which usually refers to the type of data it was trained on (e.g. bert-base-uncased).

In the following link you can see the possible models that can be used by simple-transformers.

* https://simpletransformers.ai/docs/classification-specifics/#supported-model-types

This refers to the model type or architecture. There might be various types of models trained for different purposes that use the same architecture (e.g. SciBERT). You can downlaod the most common models directly from huggingface:

* https://huggingface.co/transformers/pretrained_models.html

You can also download community models here:

* https://huggingface.co/models

Below we define a list of the different transformer models we are going to use. These are listed in the following order: Name (e.g. BERT), Architecture (e.g. bert), Specific Model (e.g. bert-base-uncased)


In [11]:
# ===== Stable, no-freeze rewrite =====
import os, gc, torch, numpy as np, pandas as pd
from tqdm.auto import tqdm
os.environ["TOKENIZERS_PARALLELISM"] = "false"   # 避免分词并行卡住

from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

In [9]:
# ✅ 严格按 simpletransformers 支持的 model_type（classification-specifics）
CLASSIFIERS = [
    ["BERT",        "bert",        "bert-base-uncased"],
    ["RoBERTa",     "roberta",     "roberta-base"],
    ["DeBERTa",     "deberta",     "microsoft/deberta-base"],          # v1 架构，对应 model_type='deberta'
    ["Longformer",  "longformer",  "allenai/longformer-base-4096"],    # 长序列
    ["BigBird",     "bigbird",     "google/bigbird-roberta-base"],     # 长序列
    ["DistilBERT",  "distilbert",  "distilbert-base-uncased"],
    ["ALBERT",      "albert",      "albert-base-v2"],
    ["SciBERT",     "bert",        "allenai/scibert_scivocab_uncased"],# 属于 BERT 架构
    ["PatentBERT",  "bert",        "anferico/bert-for-patents"],       # 属于 BERT 架构
    ["BioBERT",     "bert",        "dmis-lab/biobert-v1.1"],           # 属于 BERT 架构
    ["XLNet",       "xlnet",       "xlnet-base-cased"],
    ["ELECTRA",     "electra",     "google/electra-base-discriminator"],
]


In [15]:
# ===== 配置 =====
NUM_OF_SPLITS = 2
Reweight = True

RESULTS, Classified_Values = [], []

BASE_ARGS = {
    "num_train_epochs": 1,
    "train_batch_size": 32,
    "eval_batch_size": 64,
    "max_seq_length": 512,
    "fp16": True,
    "overwrite_output_dir": True,
    "use_early_stopping": True,
    "early_stopping_patience": 1,
    "early_stopping_metric": "eval_loss",
    "reprocess_input_data": True,
    "no_save": True,
    "no_cache": True,
    "silent": False,
    "logging_steps": 20,
    "process_count": 1,
    "use_multiprocessing": False,
    "use_multiprocessing_for_evaluation": False,
    "dataloader_num_workers": 0,
}

LONG_ARGS = {
    "max_seq_length": 4096,
    "sliding_window": True,
    "stride": 0.8,
    "train_batch_size": 4,
    "eval_batch_size": 8,
}

def build_args_for(model_type, model_name):
    args = BASE_ARGS.copy()
    args["do_lower_case"] = ("uncased" in model_name.lower())
    if model_type in {"longformer", "bigbird"}:
        args.update(LONG_ARGS)
    return args

# DeBERTa 等不支持 class weights 的兼容表
try:
    from simpletransformers.classification.classification_model import MODELS_WITHOUT_CLASS_WEIGHTS_SUPPORT as _NOCW
    UNSUPPORTED_CW = set(_NOCW)
except Exception:
    UNSUPPORTED_CW = {"deberta"}  # 兜底：已知 'deberta' 不支持

def logits_to_prob(raw_outputs):
    """将 raw logits 转为正类概率，兼容 [N,2] 和 [N,1] 两种形状。"""
    import numpy as np
    raw = np.array(raw_outputs)
    if raw.ndim == 2 and raw.shape[1] == 2:
        raw = raw - raw.max(axis=1, keepdims=True)
        exp = np.exp(raw)
        probs = exp / exp.sum(axis=1, keepdims=True)
        return probs[:, 1]
    else:
        # 单logit情形或兜底
        z = raw.squeeze()
        return 1.0 / (1.0 + np.exp(-z))

# ===== 主循环：每个模型 × K 折 =====
use_cuda = torch.cuda.is_available()
for name, model_type, model_name in tqdm(CLASSIFIERS, desc="Evaluating Classifiers", leave=True):
    y_actual, y_predicted, id_s = [], [], []
    prob_pos_all = []

    kf = StratifiedKFold(n_splits=NUM_OF_SPLITS, shuffle=True, random_state=1)
    for train_i, test_i in tqdm(kf.split(Abstract_Text, Classes),
                                desc=f"{name} | Cross-Validating",
                                leave=False, total=NUM_OF_SPLITS):

        # 原来这行：
        Y = np.array(Classes)

        # >>> 在它后面立刻加上：把二维标签（one-hot/概率）统一成一维 0/1
        if Y.ndim == 2:
            if Y.shape[1] == 2:
                # 如果是 one-hot（全是0/1），用 argmax；如果是概率分数，用第二列阈值0.5
                if np.array_equal(Y, Y.astype(int)):
                    Y = Y.argmax(axis=1).astype(int)
                else:
                    Y = (Y[:, 1] >= 0.5).astype(int)
            else:
                raise ValueError(f"期望二分类标签为形如 (N,2) 的 one-hot/概率，当前形状 {Y.shape} 无法自动处理。")

        X = np.array(Abstract_Text)
        train_X, test_X = X[train_i], X[test_i]
        train_y, test_y = Y[train_i], Y[test_i]
        Train_IDs, Test_IDs = IDs[train_i], IDs[test_i]

        # simpletransformers 训练输入
        train_df = pd.DataFrame({"text": list(train_X), "labels": list(train_y)})

        # ===== 类别不平衡处理：支持权重的模型用 class weights；不支持的做过采样 =====
        weight_vec = None
        if Reweight:
            from sklearn.utils.class_weight import compute_class_weight
            cls_w = compute_class_weight(class_weight="balanced", classes=np.array([0,1]), y=train_y)
            weight_vec = cls_w.tolist()

        if Reweight and (model_type in UNSUPPORTED_CW):
            # 过采样少数类至与多数类等量
            c0 = (train_df["labels"] == 0).sum()
            c1 = (train_df["labels"] == 1).sum()
            if c0 and c1 and c0 != c1:
                if c0 < c1:
                    need = c1 - c0
                    add = train_df[train_df["labels"] == 0].sample(need, replace=True, random_state=1)
                else:
                    need = c0 - c1
                    add = train_df[train_df["labels"] == 1].sample(need, replace=True, random_state=1)
                train_df = pd.concat([train_df, add], ignore_index=True)
            weight_vec = None  # 不再传 class weights，避免报错

        # 创建并训练模型（训练与预测在每一折内部完成）
        args = build_args_for(model_type, model_name)
        model = ClassificationModel(
            model_type, model_name,
            weight=weight_vec,
            args=args,
            use_cuda=use_cuda
        )
        model.train_model(train_df)

        # 预测：拿 logits → 概率
        preds, raw_outputs = model.predict(list(test_X))
        prob_pos = logits_to_prob(raw_outputs)

        # 累计
        id_s.extend(list(Test_IDs))
        y_actual.extend(list(test_y))
        y_predicted.extend(list(preds))
        prob_pos_all.extend(list(prob_pos))

        gc.collect(); torch.cuda.empty_cache()

    # ===== 五折汇总指标 =====
    Share = np.round(np.mean(y_predicted), 3)
    Accuracy = accuracy_score(y_actual, y_predicted)
    ROC = roc_auc_score(y_actual, prob_pos_all)  # 用概率算 AUC
    Precision = precision_score(y_actual, y_predicted, zero_division=0)
    Recall = recall_score(y_actual, y_predicted, zero_division=0)
    F1 = f1_score(y_actual, y_predicted, zero_division=0)

    tn, fp, fn, tp = confusion_matrix(y_actual, y_predicted).ravel()
    TPR = round(tp/(tp+fn), 3)
    FNR = round(fn/(tp+fn), 3)
    FPR = round(fp/(fp+tn), 3)
    TNR = round(tn/(tn+fp), 3)

    RESULTS.append([name, Share, TPR, FNR, FPR, TNR,
                    round(Accuracy,3), round(ROC,3),
                    round(Precision,3), round(Recall,3), round(F1,3)])

    Classified_Values.append(list(zip(len(id_s)*[name], id_s, y_actual, y_predicted)))


Evaluating Classifiers:   0%|          | 0/12 [00:00<?, ?it/s]

BERT | Cross-Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

RoBERTa | Cross-Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

DeBERTa | Cross-Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Longformer | Cross-Validating:   0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/250 [00:00<?, ?it/s]

  0%|          | 0/2000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/250 [00:00<?, ?it/s]

ValueError: y should be a 1d array, got an array of shape (4000, 2) instead.

# D.4. Output Classification Results #
----


In [None]:
# Convert List to Dataframe #
RESULTS_TABLE = pd.DataFrame(RESULTS, columns = ["Name", "Share", "True-Positives",
                                                 "False-Negatives", "False-Positives",
                                                 "True-Negatives","Accuracy", "AUC",
                                                 "Precision", "Recall", "F1"] )

RESULTS_TABLE["Type"] = "Transformer"
RESULTS_TABLE = RESULTS_TABLE[["Name", "Type", "Share", "True-Positives",
                               "False-Negatives", "False-Positives",
                               "True-Negatives","Accuracy", "AUC",
                               "Precision", "Recall", "F1"]]



# Output Results #
RESULTS_TABLE.sort_values("Accuracy", ascending = False ).to_csv("./Output/Model Performance/Transformer Classification Model Performance.csv")

# Display Results -- Out of Sample (Holdout) prediction -- Sorted by Accuracy #
RESULTS_TABLE.sort_values("Accuracy", ascending = False )


In [None]:
# Output Classification Results for Training Dataset -- PREDICTED VALUES -- Out Of Sample (Holdout) Prediction #

for i in range(0,len(Classified_Values), 1):

  Temp = pd.DataFrame(  Classified_Values[i],
                        columns = ['Model', 'id', 'Actual', 'Predicted'] )

  if i == 0:
    name = Temp.head(1)['Model'][0]
    Temp = Temp[['id', 'Actual', 'Predicted']]
    Temp.columns = ['id', 'Actual', name]
    Final = Temp

  else:

    name = Temp.head(1)['Model'][0]
    Temp = Temp[['id', 'Predicted']]
    Temp.columns = ['id', name]

    Final = Final.merge(Temp, on = ['id'])

# Save Data Frame #
Final.to_csv("./Output/Classification Output/Transformer Classification Results.csv")

Delete files that were created behind the scenes by the transformer model.

In [None]:
rm -rf "./outputs"

In [None]:
rm -rf "./runs"