# Feature Dictionary Tutorial

This notebook provides an interactive tutorial for the Feature Dictionary library.

## Overview

Feature Dictionary is a library for analyzing neural network activations:
- Extract interpretable features using dictionary learning
- Identify meaningful directions in activation space
- Visualize and understand learned representations

## Installation

```bash
pip install feature-dictionary
```

In [None]:
# Import the library
from feature_dictionary import (
    FeatureExtractor,
    DictionaryLearner,
    ExtractorConfig,
)

print("Feature Dictionary loaded successfully!")

## Basic Feature Extraction

Let's start by creating sample activations and extracting features.

In [None]:
import numpy as np

# Create sample activations (simulating neural network hidden states)
np.random.seed(42)
n_samples = 1000
hidden_dim = 256

# Simulate activations with some structure
activations = np.random.randn(n_samples, hidden_dim) * 0.1

# Add some interpretable features
# Feature 1: Active for "positive" samples
positive_mask = np.random.rand(n_samples) > 0.5
activations[positive_mask, 0:10] += 2.0

# Feature 2: Active for "large" samples
large_mask = np.random.rand(n_samples) > 0.7
activations[large_mask, 20:30] += 1.5

print(f"Activations shape: {activations.shape}")
print(f"Positive samples: {positive_mask.sum()}")
print(f"Large samples: {large_mask.sum()}")

## Dictionary Learning

Use dictionary learning to find interpretable features.

In [None]:
# Create a dictionary learner
learner = DictionaryLearner(
    n_features=64,  # Number of dictionary features to learn
    sparsity=0.1,   # Target sparsity
)

# Learn the dictionary
dictionary = learner.fit(activations)

print(f"Dictionary shape: {dictionary.shape}")
print(f"Learned {dictionary.n_features} features")

## Feature Analysis

Analyze the learned features to understand what they represent.

In [None]:
# Create feature extractor
extractor = FeatureExtractor(dictionary)

# Extract features from activations
features = extractor.extract(activations)

print(f"Extracted features shape: {features.shape}")
print(f"Average sparsity: {(features == 0).mean():.2%}")

In [None]:
# Analyze top features
print("\nTop Features by Activation Frequency:")
print("-" * 40)

activation_freq = (features > 0.1).mean(axis=0)
top_indices = np.argsort(activation_freq)[::-1][:10]

for i, idx in enumerate(top_indices):
    print(f"  Feature {idx}: {activation_freq[idx]:.2%} activation rate")

## Feature Correlation Analysis

Check if learned features correlate with our known concepts.

In [None]:
# Correlate features with known labels
print("Feature Correlations with 'Positive' concept:")
print("-" * 40)

correlations = []
for i in range(features.shape[1]):
    corr = np.corrcoef(features[:, i], positive_mask.astype(float))[0, 1]
    correlations.append((i, corr))

# Sort by absolute correlation
correlations.sort(key=lambda x: abs(x[1]), reverse=True)

for idx, corr in correlations[:5]:
    print(f"  Feature {idx}: correlation = {corr:.3f}")

## Configuration Options

Customize the extraction with `ExtractorConfig`.

In [None]:
# Custom configuration
config = ExtractorConfig(
    n_features=128,
    sparsity=0.05,
    learning_rate=0.01,
    batch_size=64,
    n_iterations=1000,
)

print(f"Config: {config}")

## Reconstruction Quality

Evaluate how well the dictionary can reconstruct the original activations.

In [None]:
# Reconstruct activations
reconstructed = extractor.reconstruct(features)

# Calculate reconstruction error
mse = np.mean((activations - reconstructed) ** 2)
relative_error = mse / np.mean(activations ** 2)

print(f"Reconstruction MSE: {mse:.6f}")
print(f"Relative Error: {relative_error:.2%}")

## Conclusion

Feature Dictionary provides powerful tools for:
- Learning interpretable feature dictionaries
- Extracting sparse representations
- Analyzing what features represent
- Understanding neural network activations

For more examples, see the `examples/` directory in the repository.