# Patent Binary Classification: *NLP*
## By Jon Templeton

In this notebook I will be creating a binary classification of patents using Natural Language Processing. I am using the pretrained BERT model `distilbert-base-uncased` from Hugging Face. For the patent dataset, I will be using a combination of the title and abstract columns to train the model.

Steps:
1. Data Loading & Cleaning
2. Downsample Data
3. Data Preparation & Tokenization
4. Evaluation Metrics Setup
5. Model Initialization & Training Configuration
6. Model Training & Evaluation

## Data Loading & Cleaning

Load the dataset and perform initial cleaning steps. This includes removing duplicate rows, removing unnecessary columns, and handling missing values.

In [1]:
import pandas as pd

# Read data from parquet file
df = pd.read_parquet("ml_dataset.parquet")

# Remove the duplicates so that we have only one row per patent
df = df.drop_duplicates(subset=['title', 'abstract'], keep='first')

# Modify data columns
df['text'] = df['title'] + ' ' + df['abstract']
df = df.drop(columns=['title', 'abstract', 'ucid', 'code', "cpc_first_4"])

# Remove the rows with missing values
df = df.dropna()

## Downsample Data

The dataset is imbalanced at a ratio greater than 1:700. For better performance, I downsampled the majority class to balance the training data. This helps improve model performance by reducing bias towards the majority class.

In [2]:
from sklearn.utils import resample

# There is an imbalance in the dataset
# Separate majority and minority classes
df_majority = df[df['labels'] == 0]
df_minority = df[df['labels'] == 1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                   replace=False,
                                   n_samples=len(df_minority),  # match minority class
                                   random_state=42)

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
df_downsampled = df_downsampled.reset_index(drop=True)

## Data Preparation & Tokenization

Prepare dataset for classification by performing a train-test split and tokenization using `AutoTokenizer` from the Hugging Face `transformers` library.

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Determine the train and test datasets
dataset = Dataset.from_pandas(df_downsampled)
train_test_dataset = dataset.train_test_split(test_size=0.2)

# Tokenization
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=64)

tokenized_dataset = train_test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/265 [00:00<?, ? examples/s]

Map:   0%|          | 0/67 [00:00<?, ? examples/s]

## Evaluation Metrics Setup

Load evaluation metrics like accuracy, precision, recall, and F1 score using the Hugging Face `evaluate` library, and define a function `compute_metrics()` for computing these metrics on the model's predictions.

In [4]:
import evaluate
import numpy as np

# Load Evaluation metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    labels = p.label_ids
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "precision": precision.compute(predictions=preds, references=labels, average="binary")["precision"],
        "recall": recall.compute(predictions=preds, references=labels, average="binary")["recall"],
        "f1": f1.compute(predictions=preds, references=labels, average="binary")["f1"],
    }

## Model Initialization & Training Configuration

Load pre-trained BERT model `'distilbert-base-uncased'`, and set up the training parameters. Then instantiate the `Trainer` with the model, training arguments, datasets, and the custom metrics computation function.

In [5]:
# Load a pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    compute_metrics=compute_metrics
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Model Training & Evaluation

Time to actually train the model and evaluate performance.

In [6]:
# Train the Model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

# Extract and print accuracy, precision, recall, and F1 score
accuracy = results.get("eval_accuracy")
precision = results.get("eval_precision")
recall = results.get("eval_recall")
f1 = results.get("eval_f1")

print(f"\n\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1}")

# Save the model
trainer.save_model("./results/model02")

  0%|          | 0/51 [00:00<?, ?it/s]

{'train_runtime': 25.084, 'train_samples_per_second': 31.693, 'train_steps_per_second': 2.033, 'train_loss': 0.3816037271537033, 'epoch': 3.0}


  0%|          | 0/5 [00:00<?, ?it/s]



Accuracy: 0.9253731343283582
Precision: 0.9166666666666666
Recall: 0.9428571428571428
F1 Score: 0.9295774647887323


## Conclusion

This notebook has successfully implemented a binary classification model using NLP techniques. The model's training is quick and achieves reasonable results based on the performance metrics.

In the other notebook `patent_linear_regression.ipynb`, I built a binary classifier using a linear regression model trained on just the CPC Codes.