# Assignment 2: Green Patent Detection (PatentSBERTa)

### Part A: Baseline Model (Frozen Embeddings) - Only the creation of the balanced dataset

In this notebook i will just create the balanced dataset of 50k claims with 25k green and 25k not-green

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

# 1 Load the dataset (train split)
dataset = load_dataset("AI-Growth-Lab/patents_claims_1.5m_traim_test", split="train")




In [None]:
# 2 Only keep the columns we need for green detection
y02_columns = [col for col in dataset.column_names if col.startswith("Y02")]
columns_to_keep = ['id', 'date', 'text'] + y02_columns


In [None]:
# 3 Use Hugging Face's select_columns to avoid loading all columns into pandas
small_dataset = dataset.remove_columns([col for col in dataset.column_names if col not in columns_to_keep])

In [None]:
# 4 Convert **only these columns** to pandas — much smaller
df = small_dataset.to_pandas()  # MUCH faster now

In [18]:
# 5️⃣ Create silver label
df['is_green_silver'] = (df[y02_columns].sum(axis=1) > 0).astype(int)

# 6️⃣ Sample 25k green + 25k non-green
green_sample = df[df['is_green_silver'] == 1].sample(n=25000, random_state=42)
non_green_sample = df[df['is_green_silver'] == 0].sample(n=25000, random_state=42)

balanced_df = pd.concat([green_sample, non_green_sample]).sample(frac=1, random_state=42).reset_index(drop=True)

# 7️⃣ Save as parquet
table = pa.Table.from_pandas(balanced_df)
pq.write_table(table, "patents_50k_green.parquet")

print("✅ Balanced 50k dataset created quickly!")

✅ Balanced 50k dataset created quickly!
