
**Contract Understanding Atticus Dataset (CUAD) v1** — Expert-annotated NLP dataset for legal contract review (NeurIPS 2021).

This notebook provides a step-by-step analysis of every aspect of the CUAD dataset, aligned with the research goals in the project README:
- **Task 1 (T1):** Risk Clause Recognition — identify specific risk clauses (e.g., Termination for Convenience, Uncapped Liability)
- **Task 2 (T2):** Structured Entity Extraction — extract entities like Agreement Date, Parties, Renewal Term as valid JSON

**Dataset Summary:** 13,000+ expert annotations across 510 commercial contracts, 41 clause categories, 25 contract types.

In [3]:
# Step 1: Setup and Load Dataset

from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re
import kagglehub


# Load CUAD from Hugging Face
try:
    # Download latest version
    path = kagglehub.dataset_download("theatticusproject/atticus-open-contract-dataset-aok-beta")
    print("Path to dataset files:", path)
    print("Loading dataset...")
except Exception as e:
    print(f"Error loading dataset: {e}")

Downloading to C:\Users\aetho\.cache\kagglehub\datasets\theatticusproject\atticus-open-contract-dataset-aok-beta\3.archive...


100%|██████████| 8.70k/8.70k [00:00<00:00, 7.97MB/s]

Extracting files...
Path to dataset files: C:\Users\aetho\.cache\kagglehub\datasets\theatticusproject\atticus-open-contract-dataset-aok-beta\versions\3
Loading dataset...





## Step 2: Dataset Structure & Schema

Explore the data schema, field types, and splits.

In [None]:
# Schema and splits
print("=== Dataset Schema ===\n")
print(cuad["train"].features)
print("\n=== Split Sizes ===")
for split in cuad.keys():
    print(f"  {split}: {len(cuad[split]):,} samples")
print(f"\n  Total: {sum(len(cuad[s]) for s in cuad):,} samples")

# Convert to DataFrame for easier analysis
train_df = pd.DataFrame(cuad["train"])
test_df = pd.DataFrame(cuad["test"])
print("\n=== Sample Record (Train) ===")
sample = train_df.iloc[0]
for col in train_df.columns:
    val = sample[col]
    if isinstance(val, str) and len(val) > 100:
        val = val[:100] + "..."
    print(f"  {col}: {val}")

Extract and analyze the 41 CUAD clause categories from the `question` field. Map to research tasks:
- **T1 (Clause Recognition):** Yes/No categories — Non-Compete, Exclusivity, Termination for Convenience, Uncapped Liability, etc.
- **T2 (Entity Extraction):** Entity categories — Document Name, Parties, Agreement Date, Effective Date, Renewal Term, etc.

In [None]:
# Extract clause category from question (format: ...related to "Category Name" that should be...)
def extract_category(q):
    m = re.search(r'related to "([^"]+)" that should be', str(q))
    return m.group(1) if m else "unknown"

for df, name in [(train_df, "train"), (test_df, "test")]:
    df["category"] = df["question"].apply(extract_category)

# Combine for full dataset stats
all_categories_train = train_df["category"].value_counts()
all_categories_test = test_df["category"].value_counts()

print("=== 41 Clause Categories (Train) ===\n")
print(all_categories_train.to_string())
print(f"\nUnique categories (train): {train_df['category'].nunique()}")
print(f"Unique categories (test): {test_df['category'].nunique()}")

# Entity vs Yes/No categories (for T1 vs T2)
ENTITY_CATEGORIES = {"Document Name", "Parties", "Agreement Date", "Effective Date", "Expiration Date", 
                     "Renewal Term", "Notice Period to Terminate Renewal", "Governing Law", "Warranty Duration"}
train_df["task_type"] = train_df["category"].apply(lambda c: "Entity (T2)" if c in ENTITY_CATEGORIES else "Clause (T1)")
test_df["task_type"] = test_df["category"].apply(lambda c: "Entity (T2)" if c in ENTITY_CATEGORIES else "Clause (T1)")

print("\n=== Task Type Distribution (Train) ===")
print(train_df["task_type"].value_counts())

## Step 4: Contract Types (25 Types)

CUAD includes 25 contract types (e.g., Distributor Agreement, License Agreement, Service Agreement). Extract from `title` field.

In [None]:
# Extract contract type from title (format: COMPANY_DATE-EX-NUM-CONTRACT TYPE)
def extract_contract_type(title):
    parts = str(title).split("-")
    return parts[-1].strip() if len(parts) > 1 else "unknown"

train_df["contract_type"] = train_df["title"].apply(extract_contract_type)
test_df["contract_type"] = test_df["title"].apply(extract_contract_type)

print("=== Contract Types (Train) - Top 20 ===")
print(train_df["contract_type"].value_counts().head(20).to_string())
print(f"\nUnique contract types: {train_df['contract_type'].nunique()}")
print(f"Unique contracts (titles): {train_df['title'].nunique()}")

## Step 5: Answer & Entity Analysis

Analyze the `answers` field: text extracted, answer_start positions, and whether answers are empty (no clause found).

In [None]:
# Answer analysis
def get_answer_text(answers):
    txt = answers.get("text", [])
    return txt[0] if txt else ""

def get_answer_start(answers):
    starts = answers.get("answer_start", [])
    return starts[0] if starts else -1

def is_empty_answer(answers):
    txt = answers.get("text", [])
    return len(txt) == 0 or (len(txt) == 1 and (txt[0] == "" or txt[0].strip() == ""))

train_df["answer_text"] = train_df["answers"].apply(get_answer_text)
train_df["answer_start"] = train_df["answers"].apply(get_answer_start)
train_df["empty_answer"] = train_df["answers"].apply(is_empty_answer)

test_df["answer_text"] = test_df["answers"].apply(get_answer_text)
test_df["answer_start"] = test_df["answers"].apply(get_answer_start)
test_df["empty_answer"] = test_df["answers"].apply(is_empty_answer)

print("=== Answer Statistics (Train) ===")
print(f"Empty answers (no clause/entity found): {train_df['empty_answer'].sum():,} ({100*train_df['empty_answer'].mean():.1f}%)")
print(f"Non-empty answers: {(~train_df['empty_answer']).sum():,}")

# Answer length distribution (non-empty)
non_empty = train_df[~train_df["empty_answer"]]
lens = non_empty["answer_text"].str.len()
print(f"\nAnswer text length (non-empty): min={lens.min()}, max={lens.max()}, median={lens.median():.0f}, mean={lens.mean():.1f}")

# Empty rate by category
empty_by_cat = train_df.groupby("category")["empty_answer"].agg(["sum", "count", "mean"])
empty_by_cat["pct_empty"] = (100 * empty_by_cat["mean"]).round(1)
print("\n=== Empty Answer Rate by Category (top 15 highest) ===")
print(empty_by_cat.nlargest(15, "mean")[["sum", "count", "pct_empty"]].to_string())

## Step 6: Context (Contract Text) Analysis

Contract length affects fine-tuning: longer contexts require more memory and may need truncation.

In [None]:
# Context length (characters and tokens approximation: ~4 chars/token)
train_df["context_len_char"] = train_df["context"].str.len()
train_df["context_len_tokens_approx"] = (train_df["context_len_char"] / 4).astype(int)

print("=== Context Length Statistics (Train) ===")
print(train_df["context_len_char"].describe().to_string())
print(f"\nApprox tokens: min={train_df['context_len_tokens_approx'].min()}, max={train_df['context_len_tokens_approx'].max()}, median={train_df['context_len_tokens_approx'].median():.0f}")

# Percentiles
for p in [50, 90, 95, 99, 100]:
    v = train_df["context_len_char"].quantile(p/100)
    print(f"  {p}th percentile: {v:,.0f} chars (~{v/4:.0f} tokens)")

# Contracts exceeding common context limits
limits = [512, 1024, 2048, 4096, 8192]
for lim in limits:
    exceed = (train_df["context_len_tokens_approx"] > lim).sum()
    print(f"\nExceeding {lim} tokens: {exceed:,} samples ({100*exceed/len(train_df):.1f}%)")

## Step 7: Data Quality Checks

Verify schema consistency, nulls, and common pitfalls for fine-tuning.

In [None]:
# Data quality
print("=== Null Check (Train) ===")
print(train_df.isnull().sum().to_string())

print("\n=== Duplicate IDs ===")
dup_ids = train_df[train_df.duplicated(subset=["id"], keep=False)]
print(f"Duplicate IDs: {dup_ids['id'].nunique()} unique IDs with duplicates, {len(dup_ids)} rows")

print("\n=== Answer Format Consistency ===")
# Check answer_start aligns with context
def answer_in_context(row):
    if row["empty_answer"]:
        return True
    txt = row["answer_text"]
    start = row["answer_start"]
    ctx = row["context"]
    if start < 0 or start >= len(ctx):
        return False
    return ctx[start:start+len(txt)] == txt

align_check = train_df[~train_df["empty_answer"]].apply(answer_in_context, axis=1)
print(f"Answer spans correctly in context: {align_check.sum()}/{len(align_check)} ({100*align_check.mean():.2f}%)")

## Step 8: Visualizations

Charts for clause distribution, contract types, answer lengths, and context lengths.

In [None]:
# Plot settings
sns.set_style("whitegrid")
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Top 20 clause categories
cat_counts = train_df["category"].value_counts().head(20)
axes[0,0].barh(range(len(cat_counts)), cat_counts.values)
axes[0,0].set_yticks(range(len(cat_counts)))
axes[0,0].set_yticklabels(cat_counts.index, fontsize=8)
axes[0,0].invert_yaxis()
axes[0,0].set_title("Top 20 Clause Categories (Train)")
axes[0,0].set_xlabel("Count")

# 2. Empty vs non-empty answers by task type
task_empty = train_df.groupby("task_type")["empty_answer"].value_counts().unstack(fill_value=0)
task_empty.plot(kind="bar", ax=axes[0,1], stacked=True, color=["#2ecc71", "#e74c3c"])
axes[0,1].set_title("Answer Presence by Task Type (T1 vs T2)")
axes[0,1].set_ylabel("Count")
axes[0,1].legend(["Has answer", "Empty"])
axes[0,1].tick_params(axis="x", rotation=0)

# 3. Context length distribution
axes[1,0].hist(train_df["context_len_char"] / 1000, bins=50, edgecolor="black", alpha=0.7)
axes[1,0].set_title("Context Length Distribution (chars, in thousands)")
axes[1,0].set_xlabel("Context length (k chars)")
axes[1,0].set_ylabel("Frequency")

# 4. Top 15 contract types
ct = train_df["contract_type"].value_counts().head(15)
axes[1,1].barh(range(len(ct)), ct.values)
axes[1,1].set_yticks(range(len(ct)))
axes[1,1].set_yticklabels(ct.index, fontsize=8)
axes[1,1].invert_yaxis()
axes[1,1].set_title("Top 15 Contract Types (Train)")
axes[1,1].set_xlabel("Count")

plt.tight_layout()
plt.show()

In [None]:
# Answer length distribution (non-empty only)
fig, ax = plt.subplots(figsize=(10, 4))
lens = train_df[~train_df["empty_answer"]]["answer_text"].str.len()
ax.hist(lens.clip(upper=500), bins=50, edgecolor="black", alpha=0.7)
ax.set_title("Answer Text Length (non-empty, capped at 500 chars)")
ax.set_xlabel("Character length")
ax.set_ylabel("Frequency")
plt.show()

## Step 9: Summary & Implications for Fine-Tuning

Key takeaways from the exploration for LoRA/QLoRA fine-tuning (as per project README):

In [None]:
# Summary
summary = {
    "Total train samples": len(train_df),
    "Total test samples": len(test_df),
    "Unique clause categories": train_df["category"].nunique(),
    "Unique contracts (titles)": train_df["title"].nunique(),
    "Empty answer rate (%)": round(100 * train_df["empty_answer"].mean(), 1),
    "Median context length (chars)": int(train_df["context_len_char"].median()),
    "Entity (T2) samples": (train_df["task_type"] == "Entity (T2)").sum(),
    "Clause (T1) samples": (train_df["task_type"] == "Clause (T1)").sum(),
}
print("=== CUAD Exploration Summary ===\n")
for k, v in summary.items():
    print(f"  {k}: {v:,}" if isinstance(v, int) else f"  {k}: {v}")
print("\nImplications:")
print("  • T1 (Clause Recognition): Yes/No + span extraction; many empty answers.")
print("  • T2 (Entity Extraction): Structured entities (dates, parties); output as JSON.")
print("  • Context truncation: Most contracts exceed 512 tokens; 4096+ recommended.")
print("  • Imbalance: Category distribution varies; consider sampling for training.")