# Chapter 1: Methodology

## Research Design

This study employs a controlled experimental design to compare the statistical properties of LLM outputs generated through two different methods:

1. **Batch Generation**: Using the `n` parameter to generate multiple completions in a single API call
2. **Sequential Generation**: Making separate API calls for each completion

Our hypothesis is that these methods may produce statistically different distributions, violating the common assumption that LLM outputs are independent and identically distributed (i.i.d.).

## Experimental Setup

### Models Tested
- OpenAI GPT-4o-mini
- OpenAI GPT-4
- Google Gemini Pro (using `candidateCount` parameter)

### Test Prompts
We use three categories of prompts to test different aspects of model behavior:

1. **Random Number Generation**: "Pick a random number between 1 and 100."
2. **Classification Tasks**: Binary sentiment analysis
3. **Creative Generation**: Short story beginnings

In [None]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

from experiments import NParameterExperiment, ExperimentConfig
from analysis import IndependenceAnalyzer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## Statistical Tests

We employ multiple statistical tests to assess independence and distributional equivalence:

### 1. Kolmogorov-Smirnov Test
Tests whether two samples come from the same distribution.

$$D_{n,m} = \sup_x |F_{1,n}(x) - F_{2,m}(x)|$$

where $F_{1,n}$ and $F_{2,m}$ are the empirical distribution functions.

In [None]:
def demonstrate_ks_test():
    """Demonstrate KS test on synthetic data."""
    from scipy.stats import ks_2samp
    
    # Same distribution
    sample1 = np.random.normal(50, 10, 100)
    sample2 = np.random.normal(50, 10, 100)
    
    ks_stat, p_value = ks_2samp(sample1, sample2)
    print(f"Same distribution: KS={ks_stat:.4f}, p={p_value:.4f}")
    
    # Different distributions
    sample3 = np.random.normal(55, 10, 100)
    ks_stat, p_value = ks_2samp(sample1, sample3)
    print(f"Different distributions: KS={ks_stat:.4f}, p={p_value:.4f}")

demonstrate_ks_test()

### 2. Position Effects Analysis

We test whether the position within a batch affects the output distribution. This is critical because prior research suggests potential within-batch dependencies.

In [None]:
def visualize_position_effects():
    """Visualize how position effects might manifest."""
    # Simulate data with position effects
    n_batches = 20
    batch_size = 5
    
    # Create data where position 0 tends higher
    data = []
    for batch in range(n_batches):
        for position in range(batch_size):
            # Add position-dependent bias
            value = 50 + (2 - position) * 3 + np.random.normal(0, 5)
            data.append({
                'batch': batch,
                'position': position,
                'value': value
            })
    
    df = pd.DataFrame(data)
    
    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Box plot by position
    df.boxplot(column='value', by='position', ax=ax1)
    ax1.set_title('Value Distribution by Position')
    ax1.set_xlabel('Position in Batch')
    ax1.set_ylabel('Value')
    
    # Mean value by position
    position_means = df.groupby('position')['value'].mean()
    ax2.plot(position_means.index, position_means.values, 'o-')
    ax2.set_title('Mean Value by Position')
    ax2.set_xlabel('Position in Batch')
    ax2.set_ylabel('Mean Value')
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    return df

df_position = visualize_position_effects()

### 3. Variance Decomposition

We decompose the total variance into within-batch and between-batch components:

$$\sigma^2_{total} = \sigma^2_{within} + \sigma^2_{between}$$

A high within/between ratio suggests position effects or other within-batch dependencies.

In [None]:
def analyze_variance_components(df):
    """Analyze variance components from position effects data."""
    # Overall variance
    overall_var = df['value'].var()
    
    # Within-batch variance (average variance within each batch)
    within_vars = df.groupby('batch')['value'].var()
    mean_within_var = within_vars.mean()
    
    # Between-batch variance (variance of batch means)
    batch_means = df.groupby('batch')['value'].mean()
    between_var = batch_means.var()
    
    print(f"Overall variance: {overall_var:.2f}")
    print(f"Mean within-batch variance: {mean_within_var:.2f}")
    print(f"Between-batch variance: {between_var:.2f}")
    print(f"Within/Between ratio: {mean_within_var/between_var:.2f}")
    
    # Visualize
    fig, ax = plt.subplots(figsize=(8, 6))
    components = ['Overall', 'Within-Batch', 'Between-Batch']
    values = [overall_var, mean_within_var, between_var]
    
    ax.bar(components, values, color=['blue', 'green', 'red'])
    ax.set_ylabel('Variance')
    ax.set_title('Variance Decomposition')
    
    for i, v in enumerate(values):
        ax.text(i, v + 1, f'{v:.2f}', ha='center')
    
    plt.show()

analyze_variance_components(df_position)

### 4. Intraclass Correlation Coefficient (ICC)

The ICC measures the proportion of variance that is due to between-group differences:

$$ICC = \frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}}$$

High ICC values (e.g., 0.69 as found by Gallo et al., 2025) indicate strong within-batch correlation.

## Research Implications

### Effect on Sample Size Calculations

When outputs are correlated, the effective sample size is reduced by the design effect:

$$n_{effective} = \frac{n_{actual}}{1 + (m-1) \times ICC}$$

where $m$ is the cluster size (batch size).

For example, with ICC = 0.69 and batch size = 10:
- Design effect = 1 + (10-1) × 0.69 = 7.21
- 100 samples → effective n ≈ 14
- **86% reduction in statistical power!**

In [None]:
def calculate_design_effect(icc, cluster_size, total_samples):
    """Calculate the design effect and effective sample size."""
    design_effect = 1 + (cluster_size - 1) * icc
    n_effective = total_samples / design_effect
    
    print(f"ICC: {icc}")
    print(f"Cluster size: {cluster_size}")
    print(f"Total samples: {total_samples}")
    print(f"Design effect: {design_effect:.2f}")
    print(f"Effective sample size: {n_effective:.1f}")
    print(f"Power reduction: {(1 - n_effective/total_samples)*100:.1f}%")
    
    # Visualize impact across different ICC values
    icc_values = np.linspace(0, 0.9, 50)
    effective_sizes = [total_samples / (1 + (cluster_size - 1) * icc) 
                      for icc in icc_values]
    
    plt.figure(figsize=(10, 6))
    plt.plot(icc_values, effective_sizes, linewidth=2)
    plt.axhline(y=total_samples, color='gray', linestyle='--', label='Actual n')
    plt.axvline(x=0.69, color='red', linestyle='--', label='Gallo et al. (2025) ICC')
    plt.xlabel('Intraclass Correlation Coefficient (ICC)')
    plt.ylabel('Effective Sample Size')
    plt.title(f'Impact of ICC on Effective Sample Size (cluster size={cluster_size})')
    plt.grid(True, alpha=0.3)
    plt.legend()
    plt.show()

# Example from Gallo et al. (2025)
calculate_design_effect(icc=0.69, cluster_size=10, total_samples=100)

## Summary

This methodology allows us to:

1. **Detect position effects**: Identify if outputs vary systematically by position within batch
2. **Measure correlation**: Quantify the degree of within-batch dependence
3. **Test distributional differences**: Determine if batch and sequential methods produce different distributions
4. **Calculate research impact**: Estimate the effect on statistical power and confidence intervals

The next chapter will apply this methodology to real LLM outputs.