# **Notebook D.** Classification Using a Transformer Model
----

One of the biggest recent developments in Natural Language Processing has come from the introduction of Transformer models (e.g. *BERT, EARNIE, RoBERTa*, etc.). The idea is that a model is trained on a very large corpus and used to create an embedding represention of words. This "raw" model can be downloaded and then fine-tuned (retrained) on our own data.

There are several ways to implement these models. Researchers who are most comfortable with Python may start with the **Transformers** library by **HuggingFace** (https://huggingface.co/transformers/). This is the most flexible approach, but it also requires effort for researchers to implement.

An alterantive is the **SimpleTransformers** library which is a wrapper for this functionality. This library contains an easy-to-use version of this transformer technique (https://simpletransformers.ai) that is similar to the sklearn commands we have used thus far.

This package is not pre-installed with colab. To do this, we need perform the following:
 - 1) run *!pip install simpletransformers* in the notebook below
 - 2) Comment out the code by putting a # in front of the line (e.g. *#!pip install simpletransformers*)
 - 3) Rerun all of the code from the top menu (or hit Ctrl+F9)




Models: https://simpletransformers.ai/docs/classification-specifics/#supported-model-types

Model Options: https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model


In [1]:
!pip install Transformers --upgrade

Collecting Transformers
  Downloading transformers-4.55.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.55.4-py3-none-any.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m121.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Transformers
  Attempting uninstall: Transformers
    Found existing installation: transformers 4.55.2
    Uninstalling transformers-4.55.2:
      Successfully uninstalled transformers-4.55.2
Successfully installed Transformers-4.55.4


In [2]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.70.5-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.49.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit->sim

In [3]:
# Load the Simple Transformers Package for Text Classification

from simpletransformers.classification import ClassificationModel

In [4]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
# Remove when testing your own code #
import warnings
warnings.filterwarnings("ignore")

In [5]:
import logging

logging.basicConfig(level=logging.ERROR)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

# D.1. Preamble: Load Packages
---

In [6]:
# General Packages #
import os
import pandas as pd
import numpy as np

# TQDM to Show Progress Bars #
from tqdm import tqdm
from tqdm.notebook import tqdm as tqdm_notebook

# SKLearn libraries for splitting sample and validation
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold, cross_val_predict
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Additional Libraries that we are using only in this notebook
import torch
import gc

In [7]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
# Remove when testing your own code #
import warnings
warnings.filterwarnings("ignore")

In [8]:
# Mount Personal Google Drive on own Machine -- You have to follow the link to log in #
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# D.2. Load Training Data ##
----------------

We are going to use the data on the Google drive. This is in a csv file, and so we are going to load the data as a dataframe, and then convert the main data (Patent Ids, Indicator for AI / Non-AI, Patent Abstract) from a Pandas DataFrame to a list (which is more easily used in later sections).

In [9]:
# Change to Working Directory with Training Data #
os.chdir("/content/drive/MyDrive/USPTO_data")

# Load Training Data #
TrainingData = pd.read_csv("./Training_Data/4K Patents - AI 20p.csv")

# Store Data in Lists for Text Classification #
IDs = np.array(TrainingData['app number'].values.tolist())
Abstract_Text = TrainingData['abstract'].values.tolist()
Classes = TrainingData['actual'].values.tolist()

# D.3. Perform Classification with Transformer Model
---

As before, we are going to go through different models and compare their performance. Recall that transformer models are pre-trained by an external entity and we are simply downloading them (pre-trained) from the web and fine tuning them our particular application.

We download these models from hugging face. We are using the simpletransformers library which allows us to automatically download and train these models, using the same basic command for different models. We simply need to specify the model architecture (e.g. BERT) and then specific model, which usually refers to the type of data it was trained on (e.g. bert-base-uncased).

In the following link you can see the possible models that can be used by simple-transformers.

* https://simpletransformers.ai/docs/classification-specifics/#supported-model-types

This refers to the model type or architecture. There might be various types of models trained for different purposes that use the same architecture (e.g. SciBERT). You can downlaod the most common models directly from huggingface:

* https://huggingface.co/transformers/pretrained_models.html

You can also download community models here:

* https://huggingface.co/models

Below we define a list of the different transformer models we are going to use. These are listed in the following order: Name (e.g. BERT), Architecture (e.g. bert), Specific Model (e.g. bert-base-uncased)


In [10]:
# ===== Stable, no-freeze rewrite =====
import os, gc, torch, numpy as np, pandas as pd
from tqdm.auto import tqdm
os.environ["TOKENIZERS_PARALLELISM"] = "false"   # 避免分词并行卡住

from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

In [11]:
# ✅ 严格按 simpletransformers 支持的 model_type（classification-specifics）
CLASSIFIERS = [
    ["BERT",        "bert",        "bert-base-uncased"],
    ["RoBERTa",     "roberta",     "roberta-base"],
    ["DeBERTa",     "deberta",     "microsoft/deberta-base"],          # v1 架构，对应 model_type='deberta'
    # ["Longformer",  "longformer",  "allenai/longformer-base-4096"],    # 长序列
    ["BigBird",     "bigbird",     "google/bigbird-roberta-base"],     # 长序列
    ["DistilBERT",  "distilbert",  "distilbert-base-uncased"],
    ["ALBERT",      "albert",      "albert-base-v2"],
    ["SciBERT",     "bert",        "allenai/scibert_scivocab_uncased"],# 属于 BERT 架构
    ["PatentBERT",  "bert",        "anferico/bert-for-patents"],       # 属于 BERT 架构
    ["BioBERT",     "bert",        "dmis-lab/biobert-v1.1"],           # 属于 BERT 架构
    ["XLNet",       "xlnet",       "xlnet-base-cased"],
    ["ELECTRA",     "electra",     "google/electra-base-discriminator"],
]


In [None]:
# ========= 配置 =========
NUM_OF_SPLITS = 5
Reweight = True

RESULTS, Classified_Values = [], []

BASE_ARGS = {
    "num_train_epochs": 10,          # 开多一点，让早停有“可停”空间
    "train_batch_size": 32,
    "eval_batch_size": 64,
    "max_seq_length": 512,           # 统一 512
    "fp16": False, # Changed from True to False to address the RuntimeError
    "overwrite_output_dir": True,

    # ——早停相关——
    "use_early_stopping": True,
    "early_stopping_patience": 1,
    "early_stopping_metric": "eval_loss",
    "evaluate_during_training": True,        # 训练期间做验证（需要 eval_df）
    "early_stopping_consider_epochs": True,  # 每个 epoch 末评估
    "early_stopping_verbose": True,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,

    # ——稳定、不卡顿——
    "reprocess_input_data": True,
    "no_save": True,                 # 只看指标，不落盘模型
    "no_cache": True,
    "silent": False,
    "logging_steps": 20,
    "process_count": 1,
    "use_multiprocessing": False,
    "use_multiprocessing_for_evaluation": False,
    "dataloader_num_workers": 0,
}

def build_args_for(model_type, model_name):
    args = BASE_ARGS.copy()
    # 仅对 *uncased* 的检查点启用小写化
    args["do_lower_case"] = ("uncased" in model_name.lower())
    return args

def logits_to_prob(raw_outputs):
    """
    将模型 raw logits 转为正类概率（用于 AUC）。
    兼容 (N,2) 双logits 与 (N,) 单logit 的情形。
    """
    raw = np.array(raw_outputs)
    raw = np.squeeze(raw)  # 兼容 (N,1,2) 之类
    if raw.ndim == 2 and raw.shape[1] == 2:
        raw = raw - raw.max(axis=1, keepdims=True)  # 数值稳定 softmax
        exp = np.exp(raw)
        probs = exp / exp.sum(axis=1, keepdims=True)
        return probs[:, 1]
    elif raw.ndim == 1:
        z = raw.squeeze()
        return 1.0 / (1.0 + np.exp(-z))            # sigmoid
    else:
        raise ValueError(f"Unexpected raw_outputs shape: {raw.shape}")

# ——确保标签为一维 0/1（如果你本来就用 df['actualb'] 一维 0/1，这里只做断言）——
Y0 = np.asarray(Classes)
assert Y0.ndim == 1, f"Labels must be 1D; got shape {Y0.shape}"
u = np.unique(Y0)
assert set(u).issubset({0, 1}), f"Labels must be 0/1; got values {u}"
LABELS_1D = Y0.astype(int)

# ========= 主循环：每个模型 × K 折 =========
use_cuda = torch.cuda.is_available()

for name, model_type, model_name in tqdm(CLASSIFIERS, desc="Evaluating Classifiers", leave=True):
    y_actual, y_predicted, id_s = [], [], []
    prob_pos_all = []

    kf = StratifiedKFold(n_splits=NUM_OF_SPLITS, shuffle=True, random_state=1)
    for train_i, test_i in tqdm(kf.split(Abstract_Text, LABELS_1D),
                                desc=f"{name} | Cross-Validating",
                                leave=False, total=NUM_OF_SPLITS):

        # ——切分这一折数据——
        X = np.array(Abstract_Text); Y = LABELS_1D
        train_X, test_X = X[train_i], X[test_i]
        train_y, test_y = Y[train_i], Y[test_i]
        Train_IDs, Test_IDs = IDs[train_i], IDs[test_i]

        # ——Simple Transformers 需要的 DataFrame——
        train_df = pd.DataFrame({"text": list(train_X), "labels": list(train_y)})

        # ——为早停切出 10% 训练折做内部验证（不会污染外部测试折）——
        es_train_df, es_val_df = train_test_split(
            train_df, test_size=0.10, stratify=train_df["labels"], random_state=42
        )

        # ——类别不平衡：计算 class weights（支持的模型将使用）——
        weight_vec = None
        if Reweight:
            cls_w = compute_class_weight(class_weight="balanced",
                                         classes=np.array([0,1]),
                                         y=es_train_df["labels"])
            weight_vec = cls_w.tolist()

        # ——创建模型（若报“不支持 class weights”，自动去掉 weight 重试）——
        args = build_args_for(model_type, model_name)
        try:
            model = ClassificationModel(
                model_type, model_name,
                weight=weight_vec,
                args=args,
                use_cuda=use_cuda
            )
        except ValueError as e:
            if "does not currently support class weights" in str(e).lower():
                model = ClassificationModel(
                    model_type, model_name,
                    weight=None,
                    args=args,
                    use_cuda=use_cuda
                )
            else:
                raise

        # ——训练（传 eval_df 以启用早停）——
        model.train_model(es_train_df, eval_df=es_val_df)

        # ——预测：拿 logits→概率，用于 AUC；preds 仍为 0/1 用于混淆矩阵等——
        preds, raw_outputs = model.predict(list(test_X))
        prob_pos = logits_to_prob(raw_outputs)

        # ——累积这一折结果——
        id_s.extend(list(Test_IDs))
        y_actual.extend(list(test_y))
        y_predicted.extend(list(preds))
        prob_pos_all.extend(list(prob_pos))

        gc.collect(); torch.cuda.empty_cache()

    # ========= 五折汇总指标（与你原脚本的输出格式一致） =========
    y_arr = np.asarray(y_actual); p_arr = np.asarray(prob_pos_all)

    Share = np.round(np.mean(y_predicted), 3)
    Accuracy = accuracy_score(y_arr, y_predicted)
    ROC = roc_auc_score(y_arr, p_arr)                   # ★ 用“概率”算 AUC（正确做法）
    Precision = precision_score(y_arr, y_predicted, zero_division=0)
    Recall = recall_score(y_arr, y_predicted, zero_division=0)
    F1 = f1_score(y_arr, y_predicted, zero_division=0)

    tn, fp, fn, tp = confusion_matrix(y_arr, y_predicted).ravel()
    # 你原脚本里的四个比率（按你后来的定义）
    FN = np.round(tn/(tn+fn), 3)   # 注意：你原来把名称与含义有点错位；下面保持与你原表头对应的输出顺序
    FP = np.round(fp/(fp+tp), 3)
    TN = np.round(fn/(tn+fn), 3)
    TP = np.round(tp/(tp+fp), 3)

    # 若你更想要标准含义（TPR/FNR/FPR/TNR），可以改成：
    # TPR = round(tp/(tp+fn), 3)
    # FNR = round(fn/(tp+fn), 3)
    # FPR = round(fp/(tn+fp), 3)
    # TNR = round(tn/(tp+fp), 3)
    # 并把下面 RESULTS 的四列替换为 [TPR, FNR, FPR, TNR]


    RESULTS.append([name, Share, TP, FN, FP, TN,
                    np.round(Accuracy, 3),
                    np.round(ROC, 3),
                    np.round(Precision, 3),
                    np.round(Recall, 3),
                    np.round(F1, 3)])

    Classified_Values.append(list(zip(len(id_s)*[name], id_s, y_actual, y_predicted)))

Evaluating Classifiers:   0%|          | 0/11 [00:00<?, ?it/s]

BERT | Cross-Validating:   0%|          | 0/5 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 3 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 4 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

RoBERTa | Cross-Validating:   0%|          | 0/5 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 3 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 4 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Predicting:   0%|          | 0/13 [00:00<?, ?it/s]

Map:   0%|          | 0/2880 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 3 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Running Epoch 4 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

# D.4. Output Classification Results #
----


In [None]:
# Convert List to Dataframe #
RESULTS_TABLE = pd.DataFrame(RESULTS, columns = ["Name", "Share", "True-Positives",
                                                 "False-Negatives", "False-Positives",
                                                 "True-Negatives","Accuracy", "AUC",
                                                 "Precision", "Recall", "F1"] )

RESULTS_TABLE["Type"] = "Transformer"
RESULTS_TABLE = RESULTS_TABLE[["Name", "Type", "Share", "True-Positives",
                               "False-Negatives", "False-Positives",
                               "True-Negatives","Accuracy", "AUC",
                               "Precision", "Recall", "F1"]]



# Output Results #
RESULTS_TABLE.sort_values("Accuracy", ascending = False ).to_csv("./Output/Model Performance/Transformer Classification Model Performance.csv")

# Display Results -- Out of Sample (Holdout) prediction -- Sorted by Accuracy #
RESULTS_TABLE.sort_values("Accuracy", ascending = False )


In [None]:
# Output Classification Results for Training Dataset -- PREDICTED VALUES -- Out Of Sample (Holdout) Prediction #

for i in range(0,len(Classified_Values), 1):

  Temp = pd.DataFrame(  Classified_Values[i],
                        columns = ['Model', 'id', 'Actual', 'Predicted'] )

  if i == 0:
    name = Temp.head(1)['Model'][0]
    Temp = Temp[['id', 'Actual', 'Predicted']]
    Temp.columns = ['id', 'Actual', name]
    Final = Temp

  else:

    name = Temp.head(1)['Model'][0]
    Temp = Temp[['id', 'Predicted']]
    Temp.columns = ['id', name]

    Final = Final.merge(Temp, on = ['id'])

# Save Data Frame #
Final.to_csv("./Output/Classification Output/Transformer Classification Results.csv")

Delete files that were created behind the scenes by the transformer model.

In [None]:
rm -rf "./outputs"

In [None]:
rm -rf "./runs"