This notebook requires GPU. In Colab, go to Runtime → Change runtime type → T4 GPU before starting.

# Cell 1: Setup & GPU Check

In [1]:

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Check GPU availability (CRITICAL for this notebook)
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device("cpu")
    print("✗ WARNING: No GPU detected. Fine-tuning will be very slow!")
    print("  Go to Runtime → Change runtime type → T4 GPU")

print(f"\nUsing device: {device}")

Mounted at /content/drive
✓ GPU available: Tesla T4
  Memory: 15.8 GB

Using device: cuda


# Cell 2: Define Paths & Load Data

In [2]:
import os
import pandas as pd

# Define paths (same as previous notebooks)
BASE_PATH = '/content/drive/MyDrive/same_words_different_worlds'

PATHS = {
    'raw': os.path.join(BASE_PATH, 'data/raw'),
    'processed': os.path.join(BASE_PATH, 'data/processed'),
    'outputs': os.path.join(BASE_PATH, 'data/outputs'),
    'models': os.path.join(BASE_PATH, 'models'),
    'figures': os.path.join(BASE_PATH, 'figures'),
}

# Load cleaned data
df = pd.read_csv(os.path.join(PATHS['processed'], '01_ai_tweets_clean.csv'))

# Ensure clean_text has no NaN values (critical for tokenizer)
df['clean_text'] = df['clean_text'].fillna("").astype(str)

# Remove any empty strings
df = df[df['clean_text'].str.len() > 10].copy()

print(f"Loaded {len(df):,} tweets for fine-tuning")
print(f"Sample text: {df['clean_text'].iloc[0][:100]}...")

Loaded 3,201 tweets for fine-tuning
Sample text: Instead of unfunded executive orders and gauzy principles, American leadership in AI demands a compr...


# Cell 3: Install Libraries & Load Tokenizer

In [3]:
# Install Hugging Face libraries
!pip install -q transformers datasets accelerate

# Import libraries
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer
from datasets import Dataset

# Load RoBERTa tokenizer
MODEL_NAME = "roberta-base"
print(f"Loading tokenizer: {MODEL_NAME}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {tokenizer.vocab_size:,}")
print(f"  Max length: {tokenizer.model_max_length}")

# Test tokenization
sample = "AI safety regulation requires bipartisan cooperation."
tokens = tokenizer.tokenize(sample)
print(f"\nSample tokenization:")
print(f"  Input: '{sample}'")
print(f"  Tokens: {tokens}")

Loading tokenizer: roberta-base


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

✓ Tokenizer loaded
  Vocabulary size: 50,265
  Max length: 512

Sample tokenization:
  Input: 'AI safety regulation requires bipartisan cooperation.'
  Tokens: ['AI', 'Ġsafety', 'Ġregulation', 'Ġrequires', 'Ġbipartisan', 'Ġcooperation', '.']


# Cell 4: Create Dataset & Tokenize

In [4]:
# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['clean_text']])

print(f"Dataset created: {len(dataset)} samples")

# Define tokenization function
def tokenize_function(examples):
    """Tokenize texts with truncation and padding."""
    return tokenizer(
        examples["clean_text"],
        truncation=True,
        padding="max_length",
        max_length=128,  # Tweets are short, 128 tokens is sufficient
        return_special_tokens_mask=True  # Needed for MLM
    )

# Apply tokenization
print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["clean_text"],  # Remove raw text, keep only tokens
    desc="Tokenizing"
)

print(f"\n✓ Tokenization complete")
print(f"  Dataset structure: {tokenized_dataset}")
print(f"  Sample input_ids (first 20): {tokenized_dataset[0]['input_ids'][:20]}")

Dataset created: 3201 samples
Tokenizing dataset...


Tokenizing:   0%|          | 0/3201 [00:00<?, ? examples/s]


✓ Tokenization complete
  Dataset structure: Dataset({
    features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 3201
})
  Sample input_ids (first 20): [0, 23271, 9, 9515, 3194, 196, 1031, 3365, 8, 25665, 5144, 7797, 6, 470, 1673, 11, 4687, 4501, 10, 5145]


# Cell 5: Setup Data Collator for MLM

In [5]:
# Data Collator for Masked Language Modeling (MLM)
# This randomly masks 15% of tokens during training
# The model learns to predict the masked words from context

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,              # Enable masked language modeling
    mlm_probability=0.15   # Standard BERT/RoBERTa masking rate
)

print("✓ Data Collator configured")
print("  Task: Masked Language Modeling (MLM)")
print("  Masking probability: 15%")
print("""
  How MLM works:
  - Input:  "AI [MASK] regulation requires bipartisan cooperation"
  - Target: "AI safety regulation requires bipartisan cooperation"
  - The model learns contextual word meanings by predicting masked words
""")

✓ Data Collator configured
  Task: Masked Language Modeling (MLM)
  Masking probability: 15%

  How MLM works:
  - Input:  "AI [MASK] regulation requires bipartisan cooperation"
  - Target: "AI safety regulation requires bipartisan cooperation"
  - The model learns contextual word meanings by predicting masked words



# Cell 6: Load Model & Configure Training

In [6]:
# Load the pre-trained RoBERTa model for Masked LM
print(f"Loading model: {MODEL_NAME}")
model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)

# Move model to GPU
model = model.to(device)

print(f"✓ Model loaded")
print(f"  Parameters: {model.num_parameters():,}")

# Configure training arguments
training_args = TrainingArguments(
    output_dir="./roberta_finetuning_temp",  # Temporary local directory
    overwrite_output_dir=True,

    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,

    # Logging & saving
    logging_steps=50,
    save_strategy="epoch",

    # Performance
    fp16=True,  # Mixed precision for faster training on T4
    dataloader_num_workers=2,

    # Disable wandb/tensorboard logging
    report_to="none"
)

print(f"\n✓ Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Total optimization steps: ~{len(tokenized_dataset) // 16 * 3}")

Loading model: roberta-base


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

✓ Model loaded
  Parameters: 124,697,433

✓ Training configuration:
  Epochs: 3
  Batch size: 16
  Learning rate: 2e-05
  Total optimization steps: ~600


# Cell 7: Initialize Trainer & Train

In [7]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer
)

print("✓ Trainer initialized")
print("\nStarting fine-tuning... (This will take ~5-10 minutes)")
print("="*60)

# Train the model
trainer.train()

print("="*60)
print("✓ Fine-tuning complete!")

  trainer = Trainer(


✓ Trainer initialized

Starting fine-tuning... (This will take ~5-10 minutes)


Step,Training Loss
50,1.8396
100,1.7686
150,1.7029
200,1.6038
250,1.5951
300,1.5914
350,1.6027
400,1.5994
450,1.4531
500,1.5082


✓ Fine-tuning complete!


The key indicator: loss decreased from 1.84 → 1.50 over training, meaning the model successfully learned patterns in congressional AI discourse. Training completed in just 2.5 minutes—efficient!

# Cell 8: Save Fine-Tuned Model

In [8]:
# Save the fine-tuned model to Google Drive
save_path = os.path.join(PATHS['models'], 'fine_tuned_roberta')

print(f"Saving model to: {save_path}")

# Save model and tokenizer
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

# Verify save
saved_files = os.listdir(save_path)
print(f"\n✓ Model saved successfully!")
print(f"  Files saved: {saved_files}")

# Calculate size
total_size = sum(os.path.getsize(os.path.join(save_path, f)) for f in saved_files)
print(f"  Total size: {total_size / 1e6:.1f} MB")

Saving model to: /content/drive/MyDrive/same_words_different_worlds/models/fine_tuned_roberta

✓ Model saved successfully!
  Files saved: ['config.json', 'model.safetensors', 'tokenizer_config.json', 'special_tokens_map.json', 'vocab.json', 'merges.txt', 'tokenizer.json', 'training_args.bin']
  Total size: 503.6 MB


# Cell 9: Verify Model with Test Prediction

In [9]:
# Quick verification: Test the fine-tuned model's mask prediction
from transformers import pipeline

# Load the saved model for testing
fill_mask = pipeline("fill-mask", model=save_path, device=0)

# Test sentences relevant to our domain
test_sentences = [
    "AI <mask> is a critical concern for Congress.",
    "We need <mask> regulation of artificial intelligence.",
    "China poses a <mask> to American AI leadership.",
]

print("="*60)
print("MODEL VERIFICATION: Fill-Mask Predictions")
print("="*60)

for sentence in test_sentences:
    print(f"\nInput: {sentence}")
    predictions = fill_mask(sentence)
    top_3 = [f"{p['token_str'].strip()} ({p['score']:.2%})" for p in predictions[:3]]
    print(f"Top 3 predictions: {', '.join(top_3)}")

Device set to use cuda:0


MODEL VERIFICATION: Fill-Mask Predictions

Input: AI <mask> is a critical concern for Congress.
Top 3 predictions: privacy (30.71%), proliferation (9.67%), innovation (4.76%)

Input: We need <mask> regulation of artificial intelligence.
Top 3 predictions: better (16.81%), more (12.94%), stronger (7.62%)

Input: China poses a <mask> to American AI leadership.
Top 3 predictions: threat (92.78%), challenge (5.20%), risk (1.17%)


The model learned domain-specific language beautifully:
* "AI privacy" as a congressional concern ✓
* "better/stronger regulation" — policy language ✓
* "China poses a threat" — 92.78% confidence! The model learned the national security framing ✓

# Wrap up

In [10]:
print("="*60)
print("NOTEBOOK 03 COMPLETE ✓")
print("="*60)
print("""
DOMAIN ADAPTATION RESULTS:

1. MODEL TRAINED:
   - Base: roberta-base (124M parameters)
   - Task: Masked Language Modeling (MLM)
   - Data: 3,201 congressional AI tweets

2. TRAINING METRICS:
   - Epochs: 3
   - Training time: ~2.5 minutes
   - Loss: 1.84 → 1.50 (improved)

3. VERIFICATION:
   - Model predicts domain-appropriate words
   - Learned congressional AI discourse patterns
   - "China poses a [threat]" - 92.78% confidence

4. MODEL SAVED:
   - Location: models/fine_tuned_roberta/
   - Size: 503.6 MB

WHY THIS MATTERS:
   - Pre-trained RoBERTa knows general English
   - Fine-tuned RoBERTa knows *congressional AI discourse*
   - Embeddings will capture domain-specific semantics

NEXT STEPS:
   → Notebook 04: Extract embeddings for all tweets
   → Notebook 05: Measure semantic distance between parties
""")

NOTEBOOK 03 COMPLETE ✓

DOMAIN ADAPTATION RESULTS:

1. MODEL TRAINED:
   - Base: roberta-base (124M parameters)
   - Task: Masked Language Modeling (MLM)
   - Data: 3,201 congressional AI tweets
   
2. TRAINING METRICS:
   - Epochs: 3
   - Training time: ~2.5 minutes
   - Loss: 1.84 → 1.50 (improved)
   
3. VERIFICATION:
   - Model predicts domain-appropriate words
   - Learned congressional AI discourse patterns
   - "China poses a [threat]" - 92.78% confidence
   
4. MODEL SAVED:
   - Location: models/fine_tuned_roberta/
   - Size: 503.6 MB

WHY THIS MATTERS:
   - Pre-trained RoBERTa knows general English
   - Fine-tuned RoBERTa knows *congressional AI discourse*
   - Embeddings will capture domain-specific semantics
   
NEXT STEPS:
   → Notebook 04: Extract embeddings for all tweets
   → Notebook 05: Measure semantic distance between parties

