# Exploring Neural Circuits in GPT-2 using Clustering and Dimensionality Reduction
In this notebook, we analyze the inner workings, or "circuits," of GPT-2, a language model from the Transformers family, by examining its hidden states and attention patterns. We'll use techniques like **Principal Component Analysis (PCA)** and **K-Means clustering** to reduce the dimensionality of the hidden states and identify clusters, which may correspond to functional circuits in the model.

---

## Step 1: Imports and Model Setup
We start by importing the necessary libraries and loading the GPT-2 model and tokenizer. The model will run in evaluation mode to prevent any training operations and ensure consistent results.


In [3]:
# Import libraries for model, clustering, and visualization
import torch
from transformers import GPT2Model, GPT2Tokenizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import os

# Load a pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2Model.from_pretrained(model_name, output_attentions=True, output_hidden_states=True)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()


GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2SdpaAttention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

## Step 2: Function to Capture Model Outputs
We'll define a helper function, `get_model_outputs`, that processes an input sentence and extracts **attention weights** and **hidden states** from each layer of the model.

- **Attention weights** show which words (tokens) are considered important at each step.
- **Hidden states** represent the internal embeddings at each layer, capturing semantic and syntactic information about the sentence.


In [4]:
# Function to capture activations and attentions for a given sentence
def get_model_outputs(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)
    attentions = outputs.attentions  # List of attention tensors for each layer
    hidden_states = outputs.hidden_states  # List of hidden state tensors for each layer
    return attentions, hidden_states


## Step 3: Dimensionality Reduction and Clustering
To detect neural circuits, we'll reduce the dimensionality of the hidden states for a selected layer using **Principal Component Analysis (PCA)** and then cluster the data with **K-Means**. This will allow us to identify groups of neurons (or activations) that may correspond to specific roles in processing language.

### Function: `find_circuits`
- **PCA** reduces the dimensionality of hidden states, helping visualize the circuits and simplifying the clustering.
- **K-Means clustering** groups the neurons into clusters, which may correspond to circuits or functional groups.


In [5]:
# Apply PCA and K-means to find circuits
def find_circuits(hidden_states, layer_index, n_components=10, n_clusters=5):
    """
    Apply PCA and K-means clustering on hidden states for a given layer.
    
    Parameters:
        hidden_states (list): List of hidden state tensors for each layer.
        layer_index (int): Index of the layer to analyze.
        n_components (int): Desired number of PCA components.
        n_clusters (int): Number of clusters to find with K-means.
        
    Returns:
        cluster_labels (np.array): Array with cluster assignments for each neuron.
    """
    layer_activations = hidden_states[layer_index].squeeze(0).detach().cpu().numpy()  # Shape: (seq_len, hidden_dim)

    # Flatten across the sequence length
    flattened_activations = layer_activations.reshape(-1, layer_activations.shape[-1])

    # Set n_components dynamically based on data size
    n_components = min(n_components, flattened_activations.shape[0], flattened_activations.shape[1])

    # Dimensionality reduction using PCA
    pca = PCA(n_components=n_components)
    reduced_activations = pca.fit_transform(flattened_activations)

    # Clustering with KMeans
    kmeans = KMeans(n_clusters=n_clusters)
    cluster_labels = kmeans.fit_predict(reduced_activations)

    # Print cluster information
    print(f"Layer {layer_index}: Circuit Clusters Distribution - {np.bincount(cluster_labels)}")
    return cluster_labels, reduced_activations


## Step 4: Visualizing the Clusters with t-SNE
The hidden states are clustered into groups, and **t-SNE** is used to visualize these clusters in a 2D space. This visualization can reveal if certain words or parts of sentences activate specific neural circuits within the model.

### Function: `visualize_clusters`
This function applies t-SNE to reduce the hidden states further for visualization and then plots the clusters in 2D.


In [None]:
def visualize_clusters(reduced_activations, cluster_labels, sentence, layer_name):
    # Dynamically set perplexity to be less than the number of samples
    n_samples = reduced_activations.shape[0]
    perplexity = min(30, n_samples - 1)  # Ensure perplexity is less than n_samples

    tsne = TSNE(n_components=2, random_state=0, perplexity=perplexity)
    tsne_result = tsne.fit_transform(reduced_activations)
    
    plt.figure(figsize=(8, 6))
    scatter = plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=cluster_labels, cmap='viridis')
    plt.colorbar(scatter, label="Cluster")
    plt.title(f"Cluster Visualization for {layer_name}\nSentence: {sentence}")
    plt.xlabel("t-SNE Component 1")
    plt.ylabel("t-SNE Component 2")

    # Create a filename based on the layer and a shortened version of the sentence
    short_sentence = "_".join(sentence.split()[:5])  # Use the first 5 words as part of the filename
    filename = f"{layer_name}_{short_sentence}.png"
    
    # Save the plot
    output_dir = "output"
    os.makedirs(output_dir, exist_ok=True)  # Create directory if it doesn't exist
    filepath = os.path.join(output_dir, filename)
    plt.savefig(filepath)
    plt.close()  # Close the figure to free up memory
    print(f"Saved cluster visualization to {filepath}")


## Step 5: Analyzing Multiple Sentences
To observe how circuits activate for different inputs, we'll analyze several sentences. We’ll apply the functions defined above to:
1. Extract hidden states and attentions.
2. Perform PCA and clustering on selected layers.
3. Visualize clusters in 2D using t-SNE.

This allows us to compare the circuits activated by different sentence structures.


In [7]:
# Test multiple sentences to compare circuit activation
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "She sells sea shells by the sea shore.",
    "Artificial intelligence is transforming the world."
]

# Analyze each sentence
for sentence in sentences:
    print(f"Analyzing sentence: '{sentence}'")
    # Get model outputs for the current sentence
    attentions, hidden_states = get_model_outputs(sentence)

    # Analyze circuits for a specific layer
    layer_index = 10  # Choose a layer, e.g., layer 10
    cluster_labels, reduced_activations = find_circuits(hidden_states, layer_index)
    print(f"Cluster labels for layer {layer_index}: {cluster_labels}\n")

    # Visualize clusters
    visualize_clusters(reduced_activations, cluster_labels, sentence, f"Layer {layer_index}")




Analyzing sentence: 'The quick brown fox jumps over the lazy dog.'
Layer 10: Circuit Clusters Distribution - [1 1 2 2 4]
Cluster labels for layer 10: [1 2 2 4 4 3 3 0 4 4]

Saved cluster visualization to cluster_visualizations/Layer 10_The_quick_brown_fox_jumps.png
Analyzing sentence: 'She sells sea shells by the sea shore.'
Layer 10: Circuit Clusters Distribution - [3 1 2 2 1]
Cluster labels for layer 10: [1 0 4 0 3 3 2 2 0]

Saved cluster visualization to cluster_visualizations/Layer 10_She_sells_sea_shells_by.png
Analyzing sentence: 'Artificial intelligence is transforming the world.'
Layer 10: Circuit Clusters Distribution - [1 2 2 1 2]
Cluster labels for layer 10: [0 2 2 4 1 1 3 4]

Saved cluster visualization to cluster_visualizations/Layer 10_Artificial_intelligence_is_transforming_the.png


### Some Ideas

Now that you've visualized neural circuits for different sentences, here are some ideas to deepen your analysis:

**Compare Layers**: Try analyzing different layers (e.g., `layer_index = 5` or `layer_index = 11`) to see how circuits differ across layers. Early layers might focus on simpler linguistic structures, while deeper layers capture more complex relationships.

**Adjust Clustering Parameters**: Experiment with the number of clusters (`n_clusters`) and PCA components (`n_components`) to see how the model groups neurons differently. Does changing `n_clusters` result in more meaningful or interpretable circuits?

**Analyze Sentence Structure Effects**: Test sentences with different structures or topics. For example:
   - Short versus long sentences.
   - Questions versus statements.
   - Sentences with technical vocabulary versus common language.

**Visualize Attention Patterns**: Although this notebook focuses on hidden states, attention patterns are another valuable part of interpretability. Explore the `attentions` output to understand how the model distributes attention across tokens.

**Use Other Models**: Swap out GPT-2 with other transformer models (e.g., GPT-3, BERT, or T5) to see if their circuits activate differently for the same sentences. You can replace `GPT2Model` with other transformer models available in the `transformers` library.

**Experiment with Dimensionality Reduction**: Instead of PCA, try other dimensionality reduction techniques like `UMAP` or adjust the perplexity in t-SNE to see how the visualization changes.

Enjoy exploring!
