Skip to content

Conversation

@jeremymanning
Copy link
Member

Summary

Fixes #28 - Resolves ValueError when t-test visualization computes axis limits with NaN/Inf values.

Changes

Core Fix (llm_stylometry/visualization/t_tests.py):

  1. Data Validation - Added three-tier check in calculate_t_statistics():

    • If n≥2 for both groups: run t-test
    • If 0<n<2: log insufficient data, append NaN
    • If n=0: log no data, append NaN
  2. Robust Axis Limits - Multi-layer protection in both figure functions:

    • Filter out NaN/Inf before computing limits
    • Add padding for visualization
    • Ensure threshold line (p<0.001) always visible
    • Final validation for finite limits
    • Sensible defaults (-1.0 to 5.0) as fallback

New Test Suite (tests/test_t_test_edge_cases.py):

  • 11 comprehensive tests with real data (no mocks)
  • Tests single samples, zero variance, outliers, empty data, etc.

Test Results

✅ All 11 edge case tests pass
✅ Existing t-test visualization tests pass
✅ Production data tests pass (baseline + function variant models)

Root Causes Addressed

  1. Insufficient sample sizes - Now validates n≥2 before t-test
  2. Zero variance groups - Filters Inf values from axis calculations
  3. Empty data groups - Handles missing epochs gracefully
  4. Mixed valid/invalid data - Filters invalid values while preserving valid data

The fix ensures t-test visualizations handle all edge cases gracefully without raising ValueError, while maintaining accurate axis limits for valid data.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

jeremymanning and others added 22 commits September 30, 2025 23:03
Issue 1: Add model weight files to .gitignore
- Added model.pth and pytorch_model.bin to ignored files
- Prevents git pull conflicts on GPU cluster after training
- Ensures only config/logs are tracked, not large weight files

Issue 2: Fix variant flag passing in remote_train.sh
- Added VARIANT_ARG to SSH environment variables (line 94)
- Added VARIANT_ARG to training script template (line 177)
- Variant flags now properly passed to run_llm_stylometry.sh on remote server
- Tested with --content-only, --function-only, --part-of-speech flags

Both fixes tested and verified working.
Issue #19: Update outdated model counts
- Repository structure now shows 80 baseline + 240 variants = 320 total
- Updated model count references throughout (lines 34, 270, 328, 332)
- Added note in Model Configuration that same config applies to all variants
- Clarified training pipeline trains 80 models per condition

Issue #20: Document sync_models.sh variant flags
- Added comprehensive variant flag documentation (lines 438-444)
- Documented stackable flags with examples
- Explained verification process and behavior
- Updated "How it works" section to reflect variant support

Both issues verified and ready to close.
Issue #21: More efficient model syncing

**Problem:**
Previously backed up ENTIRE models/ directory even when syncing one variant,
requiring users to restore backup to get other variants back.

**Solution:**
- Only backup all models when doing full sync (--all flag)
- For partial syncs, merge new models into existing directory
- Preserves models from other variants that aren't being synced
- Added clear message about merge behavior

**Example:**
User has baseline + content locally, runs: ./sync_models.sh --function-only
- Old behavior: Backup all 160 models, download 80, need to restore
- New behavior: Merge 80 function models into existing 160 models = 240 total

**Safety:**
- rsync naturally handles file replacement safely
- Existing models from other variants untouched
- Full backup still available with --all flag

Closes #21
Major improvements:
- Consolidated Quick Start section (merged CLI and Python API)
- Streamlined Testing section (moved details to CONTRIBUTING.md)
- Reorganized Analysis Variants section (more concise)
- Removed redundant sections (Available Figures, Statistical Analysis)
- Removed "(no mocks)" comments from user-facing docs

New files:
- CONTRIBUTING.md: Comprehensive contribution guide with testing philosophy
- .github/PULL_REQUEST_TEMPLATE.md: PR checklist with best practices

Closes #22

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Improvements:
- Removed training commands from Quick Start (now references full section)
- Consolidated all training instructions in "Training Models from Scratch"
- Removed redundant variant training examples from Analysis Variants section
- Added comprehensive variant training examples to both local and remote training sections
- Streamlined section headers and organization

Training instructions now appear in one place with complete documentation for:
- Baseline and variant training
- Local and remote training
- Resume and GPU limit options

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Issue: Script failed on macOS with "declare: -A: invalid option" error
Root cause: Associative arrays (declare -A) require Bash 4.0+, but macOS ships with Bash 3.2

Fix:
- Replaced associative arrays with simple named variables
- VARIANT_COUNTS[BASELINE] → BASELINE_COUNT
- VARIANT_MISSING[BASELINE] → BASELINE_MISSING
- VARIANT_STATUS[BASELINE] → BASELINE_STATUS
- Applied same pattern for CONTENT, FUNCTION, and POS variants

The script now works on macOS default Bash 3.2 and remains compatible with Bash 4+.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…solidation

Issues fixed:
1. Training new variants was deleting ALL existing models
2. Consolidation always saved to model_results.pkl regardless of variant

Changes to generate_figures.py:
- Only delete models for the specific variant being trained
- Baseline training: Only removes baseline models (no _variant= in name)
- Variant training: Only removes models with that specific variant
- Auto-detect data path based on variant (model_results_{variant}.pkl)
- Pass --variant flag to consolidation script

Changes to consolidate_model_results.py:
- Added --variant flag to filter models by variant
- Auto-determine output path: model_results_{variant}.pkl for variants
- Filter model directories based on variant before consolidation
- Baseline: models without _variant= in name
- Variants: models with _variant={variant}_ in name

Changes to sync_models.sh:
- Added ERROR_MSG parsing for better error reporting
- Report specific path when models directory not found

This allows training all 4 conditions (baseline + 3 variants) without
models overwriting each other, and keeps results in separate files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…fic results

Issues fixed:
1. rsync was not properly excluding non-matching files, potentially overwriting local models
2. Only model_results.pkl was downloaded, not variant-specific files

Changes:
1. Fixed rsync include/exclude patterns:
   - Added explicit --exclude="*" to exclude everything not matched by includes
   - Added include pattern for directory contents: */***
   - Now only syncs the specific variant requested, preserving others locally

2. Download variant-specific result files:
   - Baseline: model_results.pkl
   - Variants: model_results_{variant}.pkl
   - Loops through all synced variants and downloads corresponding files
   - Provides helpful error messages if files not found

3. Consolidated function variant results to model_results_function.pkl

This ensures partial syncs (e.g., --content-only) don't delete baseline or other
variant models stored locally, and all result files are properly downloaded.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
NumPy 2.x is required by newer versions of dependencies (scipy, pandas, etc.)
that use numpy._core.numeric. The numpy<2 constraint was causing import
errors when generating figures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Removed requirements.txt (redundant with pyproject.toml)
- Updated numpy requirement from <2 to >=2.0 in pyproject.toml
- pyproject.toml is now the single source of truth for dependencies

The numpy<2 constraint was causing 'No module named numpy._core.numeric'
errors when generating figures. NumPy 2.x is required by modern versions
of scipy, pandas, and other dependencies.

Installation: pip install -e .

Related issues:
- Fixes figure generation errors with function-only variant
- Eliminates duplicate/conflicting dependency specifications

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Resolves #24

## Implementation

Added fairness-based loss thresholding to ensure fair comparison of variant models that converge to different final losses:

1. **New fairness module** (`llm_stylometry/analysis/fairness.py`):
   - `compute_fairness_threshold()`: Computes max of all models' min losses within 500 epochs
   - `apply_fairness_threshold()`: Truncates data at first epoch crossing threshold
   - Full documentation with examples

2. **Updated visualization functions**:
   - Added `apply_fairness` parameter (default: True) to all affected figures
   - Automatically applies fairness when variant specified and apply_fairness=True
   - Affected: all_losses, stripplot, heatmaps, mds, oz_losses
   - NOT affected: t_test figures (require all 500 epochs)

3. **CLI support**:
   - Added `--no-fairness` flag to disable fairness thresholding
   - Passes `apply_fairness=not args.no_fairness` to visualization functions

4. **Documentation**:
   - Comprehensive README section on fairness-based loss thresholding
   - CLI and Python API examples
   - Example results showing threshold computation and data reduction

## Algorithm

1. Compute fairness threshold: max(all models' min losses within 500 epochs)
2. For each model, find first epoch where training loss ≤ threshold
3. Truncate all data (train + eval) at that epoch for each model
4. All models now compared at same training loss level

## Example Results (function variant)

- Fairness threshold: 1.2720 (Austen's minimum loss)
- Models truncated between epochs 88-500
- Data reduced: 360,640 rows → 170,659 rows (47.3%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem:
- Austen models showed negative t-statistics in function-only variant
- Figures had hardcoded y_min=0, clipping negative values
- Austen was completely invisible in Figure 2A/2B function variant

Solution:
- Replace hardcoded y-axis limits with dynamic calculation
- Calculate y_min and y_max from actual data range
- Add 5% padding for better visualization
- Ensure threshold line (t=3.291) always visible
- Allow negative t-statistics when present in data

Changes:
- llm_stylometry/visualization/t_tests.py:
  - Updated generate_t_test_figure() with dynamic y-limits
  - Updated generate_t_test_avg_figure() with dynamic y-limits
  - Fixed linting errors (unused imports, f-strings)

- tests/test_visualization_edge_cases.py:
  - Added comprehensive test suite (8 test classes)
  - All tests use REAL data (no mocks)
  - Tests cover: negative t-stats, y-limit correctness, baseline regression,
    threshold visibility, average figure, edge cases, CLI integration

- Regenerated figures:
  - paper/figs/source/t_test_function.pdf (Austen now visible!)
  - paper/figs/source/t_test_avg_function.pdf (Austen now visible!)
  - paper/figs/source/t_test.pdf (baseline - verified correct)
  - paper/figs/source/t_test_avg.pdf (baseline - verified correct)

Results:
- Function variant: Y-axis range [-4.35, 12.35] (was [0, ~60])
- Austen negative t-statistics now fully visible
- Baseline figures unchanged (regression test passed)
- All linting checks passed

Fixes #25

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Figure 5 (Oz authorship analysis) is only needed for baseline models,
not for variant models (content, function, POS). This commit implements
the skip functionality with clear user messaging.

Changes:
- oz_losses.py: Add check to skip Figure 5 for any variant, return None
- generate_figures.py: Handle None return from oz_losses_figure
- generate_figures.py: Conditionally include Figure 5 only for baseline
- generate_figures.py: Update file verification to skip Figure 5 for variants
- run_llm_stylometry.sh: Update help text to indicate Figure 5 is baseline-only
- CLAUDE.md: Document that Figure 5 is baseline-only

Testing:
- Verified baseline Figure 5 generates correctly (6-panel figure)
- Verified variant Figure 5 skips with clear warning message
- Verified CLI commands work for both baseline and variants

Fixes #26
- Remove unused Path import (was causing linter warning)
- Remove dead code for variant suffix handling (variants now return early)

The variant suffix code is no longer needed since oz_losses_figure
now returns None immediately for any variant, before reaching the
save logic.
Updated two tests that were expecting Figure 5 to generate for variants:

1. test_fairness.py::test_oz_losses_figure_with_fairness
   - Now expects None return for variants (baseline-only analysis)
   - Verifies skip behavior instead of generation

2. test_interpretation_examples.py::test_workflow_figure_generation
   - Fixed incorrect flag usage (--pos-only → -pos)
   - Now expects skip message for variants
   - Only baseline should generate Figure 5

Both tests now pass and correctly verify the new behavior where
Figure 5 (Oz analysis) is skipped for variant models.
Saving current state before implementing fix for issue #27
(remote_train.sh variant flag passing bug).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: Quoted heredoc delimiter (<< 'ENDSSH') prevented variable
expansion, causing $VARIANT_ARG to be treated as literal string.

Changes:
- Changed << 'ENDSSH' to << ENDSSH to allow variable expansion
- Escaped remote shell variables ($KILL_MODE, $pid, $LOG_FILE, etc.)
- Lines 176-177 now correctly expand $RESUME_MODE and $VARIANT_ARG
- Added debug logging to display variable values in training script

Technical details:
- Unquoted heredoc delimiter allows local variables to expand
- Remote variables prefixed with \ to prevent premature expansion
- $VARIANT_ARG now contains "-co", "-fo", or "-pos" (not empty)

Testing: Bash syntax validation passes

Related: #27

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added .ssh/credentials.json to .gitignore to prevent accidental
commit of sensitive server credentials.

The credentials file has been created with 600 permissions
(read/write for owner only) for additional security.

Related: #27

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Created test suite for issue #27 with 10 integration tests:
1. Content-only variant flag passing
2. All variant types (content, function, POS)
3. Baseline (no variant) training
4. Resume mode with variants
5. Kill mode functionality
6. Debug logging verification
7. Script file permissions check
8. Repository update path
9. Screen session naming
10. Variable expansion verification

Test infrastructure:
- conftest.py: SSH fixtures using real connections
- Loads credentials from .ssh/credentials.json
- Uses sshpass for password authentication
- Creates unique test workspaces on remote server
- Comprehensive cleanup after each test

NO MOCKS - All tests use real SSH connections to real GPU server

Related: #27

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Documentation updates for issue #27:

1. test_documentation.py:
   - Added comprehensive test for variant flag documentation
   - Verifies all long-form and short-form flags are documented
   - Tests that usage information is displayed

2. README.md:
   - Added troubleshooting section for remote_train.sh
   - Includes verification commands for variant flags
   - Documents connection and screen session debugging

All documentation tested and verified.

Note: CLAUDE.md also updated locally with detailed implementation
guidance for remote_train.sh (not committed - in .gitignore).

Related: #27

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: torch.load() with map_location='cuda:N' moves tensors to CUDA,
but torch.set_rng_state() requires CPU tensors.

Changes:
- Move torch_random_state to CPU before calling set_rng_state()
- Move cuda_random_state elements to CPU (set_rng_state_all handles placement)
- Add try-except fallback: if RNG restoration fails, log warning and continue
- This allows training to resume without losing model/optimizer state

Benefits:
- Resume mode now works correctly
- Graceful fallback if RNG restoration fails for any reason
- Preserves deterministic training when possible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implemented comprehensive fix for t-test visualization ValueError when
computing axis limits with NaN/Inf t-statistics:

**Core Fix in t_tests.py:**
- Added data validation in calculate_t_statistics():
  * Check for n>=2 samples per group before running t-test
  * Log insufficient data cases (0<n<2) and missing data (n=0)
  * Append NaN for invalid cases instead of computing t-test

- Added robust axis limit calculation in both functions:
  * Filter out NaN/Inf values before computing limits
  * Add padding for visualization clarity
  * Ensure threshold line (p<0.001) is always visible
  * Final validation to ensure finite and valid limits
  * Fallback to sensible defaults (-1.0 to 5.0)

**Comprehensive Test Suite:**
- Created tests/test_t_test_edge_cases.py with 11 tests:
  1. All NaN t-statistics (single sample per group)
  2. Mixed NaN and valid data
  3. All Infinite t-statistics (zero variance)
  4. Mixed Infinite and valid data
  5. Empty data groups (missing epochs)
  6. Extreme outliers
  7. Very small sample sizes (n=2, minimum required)
  8. Average figure with all NaN
  9. Average figure with mixed data
  10. Logging verification
  11. Normal data baseline

All tests pass with real data (no mocks).

Addresses issue #28

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jeremymanning jeremymanning merged commit 70253d2 into ContextLab:main Oct 16, 2025
jeremymanning referenced this pull request in jeremymanning/llm-stylometry Oct 20, 2025
Fix issue #28: Handle NaN/Inf in t-test visualization axis limits
jeremymanning added a commit that referenced this pull request Dec 10, 2025
Fix issue #28: Handle NaN/Inf in t-test visualization axis limits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: t-test visualization fails with NaN/Inf axis limits

1 participant