# Patent Binary Classification: *Linear Regression*
## By Jon Templeton

This notebook focuses on building a logistic regression model to classify patents. The dataset `ml_dataset.parquet` includes patent information such as title, abstract, and cpc codes. The goal is to predict a binary label for each patent based on its features.

I will only be using columns `["cpc_first_4", "labels"]` to train the model. I found that the patent will classify *positive* only if `"cpc_first_4" == "H01L"`.

Steps:
1. Data Loading and Cleaning
2. Data Preprocessing
3. Model Training
4. Model Evaluation

## Data Loading and Cleaning

Load the dataset and perform initial cleaning steps. This includes removing duplicate rows and unnecessary columns, handling missing values, and saving processed subsets to csv for easier reading.

In [1]:
import pandas as pd

# Read data from parquet file
df = pd.read_parquet("ml_dataset.parquet")

# Remove the duplicates so that we have only one row per patent
# Not considering "code" column because only need first 4 chars (cpc_first_4)
df = df.drop_duplicates(subset=['title', 'abstract', 'ucid', "cpc_first_4"], 
                        keep='first')

# Drop all columns except "cpc_first_4" and "labels"
df = df.drop(columns=['title', 'abstract', 'ucid', 'code'])

# Remove the rows with missing values
df = df.dropna()

# Save as a csv for easy reading
#df.to_csv('ml_dataset_no_dup.csv')
# make a csv of all the rows with labels = 1
#df[df['labels'] == 1].to_csv('labels_1_no_dup.csv')

## Data Preprocessing

The dataset is imbalanced at a ratio greater than 1:700. For better performance, I downsampled the majority class to balance the training data. This helps improve model performance by reducing bias towards the majority class.


In [2]:
from sklearn.utils import resample
from sklearn.preprocessing import LabelEncoder

# There is an imbalance in the dataset
# Separate majority and minority classes
df_majority = df[df['labels'] == 0]
df_minority = df[df['labels'] == 1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                   replace=False,
                                   n_samples=len(df_minority),  # match minority class
                                   random_state=42)

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

# Encode the "cpc_4_encoded" column to numerical values
encoder = LabelEncoder()
df_downsampled['cpc_4_encoded'] = encoder.fit_transform(df_downsampled['cpc_first_4'])

## Model Training

Split the preprocessed data into training and testing sets, and train a logistic regression model.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_downsampled['cpc_4_encoded'], 
                                                    df_downsampled['labels'], 
                                                    test_size=0.2, 
                                                    random_state=42)

# Model Training
model = LogisticRegression()
model.fit(X_train.values.reshape(-1, 1), y_train)

# Make predictions and evaluate the model
predictions = model.predict(X_test.values.reshape(-1, 1))

## Model Evaluation

Finally, evaluate the model's performance using metrics such as accuracy, precision, recall, and F1 score. These metrics provide a comprehensive view of the model's effectiveness.

In [4]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Evaluation
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f"Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")

Accuracy: 0.8955223880597015, Precision: 0.8205128205128205, Recall: 1.0, F1 Score: 0.9014084507042254


# Conclusion

This notebook has successfully built and evaluated a logistic regression model for patent classification. The model demonstrates reasonable performance across various metrics. 

In the other notebook `patent_nlp_classification.ipynb`, I built an NLP binary classification by evaluating the columns `["title", "abstract", "labels"]` instead.