# Project Overview — CoTAdapter vs BlockUniversalAdapter vs DeepTransformerAdapter
## Citation / Reference
Primary reference used for architecture and interpretation: [CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference](https://arxiv.org/abs/2310.10845).This notebook loads our experiment results (three base models) and presents a compact analysis, plots, and conclusions.

## Methodology: Adapter-based SFT (why we used an adapter strategy)We did **not** pretrain or fine-tune the entire transformer backbone due to compute constraints. Instead, we used an *adapter-like* strategy where the original GPT-2 backbone is **frozen** and small added modules (adapters) are trained. The three architectures compared are:- **BlockUniversalAdapter**: weight-tied transformer block repeated (no CoT interleaving).- **CoTAdapter**: a CoT-style layer which interleaves intermediary states across repeats and is weight-tied (simulates chain-of-thought at token level).- **DeepTransformerAdapter**: new untied transformer layers (increased depth) added to the backbone.**Why adapters?**- Freezes the large backbone → large compute savings while allowing architectural comparisons.- Enables fast iterations across many architectural variants without pretraining from scratch.- Matches the experimental style used to study CoTFormer (see citation above) while remaining computationally feasible.Below we load the three result CSVs (DistilGPT-2, GPT-2 Small, GPT-2 Medium) and reproduce the analyses and plots discussed in the report.

In [None]:
import pandas as pdimport matplotlib.pyplot as pltdistil_csv = 'distil_results.csv'gpt2_csv = 'gpt2_results.csv'medium_csv = 'gpt2-medium_results.csv'df_distil = pd.read_csv(distil_csv)df_gpt2 = pd.read_csv(gpt2_csv)df_medium = pd.read_csv(medium_csv)df_distil, df_gpt2, df_medium

## Perplexity comparison (1 × 3)- For **DistilGPT-2** we expect the DeepTransformerAdapter to give the best PPL because the small backbone means untied extra layers add more representational power.- For **GPT-2 Small / Medium** the CoTAdapter usually wins because CoT-style repeated reasoning benefits more from richer backbone representations and scales better with model size.Run the following cell to produce the 1×3 Perplexity plots.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))model_sets = [df_distil, df_gpt2, df_medium]titles = ['DistilGPT-2', 'GPT-2 Small', 'GPT-2 Medium']for ax, df, title in zip(axes, model_sets, titles):    ax.plot(df['model_name'], df['val_ppl'], marker='o', linewidth=2)    ax.set_title(f'Perplexity — {title}')    ax.set_ylabel('Validation Perplexity')    ax.set_xlabel('Model Variant')    ax.grid(True, linestyle='--', alpha=0.5)    ax.tick_params(axis='x', rotation=20)plt.tight_layout()plt.show()

### Explanation of Perplexity patterns- **Why Deep wins on DistilGPT-2:** the small backbone has limited representation power; adding untied layers increases capacity.- **Why CoT wins on larger backbones:** repeated reasoning amplifies richer backbone representations and scales better.

## Training time comparison (1 × 3)- On small models, attention costs dominate and CoTAdapter increases training time.- On larger models, FFNs dominate compute; DeepTransformerAdapter becomes slower because it adds new FFNs.Run the following cell to generate the plots.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))for ax, df, title in zip(axes, model_sets, titles):    ax.plot(df['model_name'], df['training_time_s'], marker='o', linewidth=2)    ax.set_title(f'Training Time — {title}')    ax.set_ylabel('Training Time (s)')    ax.set_xlabel('Model Variant')    ax.grid(True, linestyle='--', alpha=0.5)    ax.tick_params(axis='x', rotation=20)plt.tight_layout()plt.show()

### Explanation of Training Time patterns- CoTAdapter is slower on small models due to attention costs.- DeepTransformerAdapter is slower on large models due to FFN costs.

## Inference time comparison (1 × 3)- CoTAdapter increases attention context, raising inference time.- DeepTransformerAdapter increases FFN cost, especially on large models.Run the next cell to generate the plots.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))for ax, df, title in zip(axes, model_sets, titles):    ax.plot(df['model_name'], df['inference_time_ms'], marker='o', linewidth=2)    ax.set_title(f'Inference Time — {title}')    ax.set_ylabel('Inference Time (ms per token)')    ax.set_xlabel('Model Variant')    ax.grid(True, linestyle='--', alpha=0.5)    ax.tick_params(axis='x', rotation=20)plt.tight_layout()plt.show()

### Explanation of Inference Time patterns- CoTAdapter adds KV entries increasing attention cost.- DeepTransformerAdapter adds FFN cost increasing compute.

## Parameter comparison (1 × 3)- DeepTransformerAdapter increases parameters more than tied architectures.- CoTAdapter maintains lower parameter count while improving perplexity.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))for ax, df, title in zip(axes, model_sets, titles):    ax.plot(df['model_name'], df['total_params'], marker='o', linewidth=2, label='Total Params')    ax.plot(df['model_name'], df['trainable_params'], marker='o', linewidth=2, label='Trainable Params')    ax.set_title(f'Parameter Count — {title}')    ax.set_ylabel('Params')    ax.set_xlabel('Model Variant')    ax.grid(True, linestyle='--', alpha=0.5)    ax.tick_params(axis='x', rotation=20)    ax.legend()plt.tight_layout()plt.show()

## Total Parameters vs Perplexity (three separate plots)We use total parameters as a compute proxy and compare against perplexity.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))model_sets_for_params = [    ('DistilGPT-2', df_distil),    ('GPT-2 Small', df_gpt2),    ('GPT-2 Medium', df_medium)]for ax, (title, df) in zip(axes, model_sets_for_params):    ax.plot(df['total_params'], df['val_ppl'], marker='o', linewidth=2)    for i, txt in enumerate(df['model_name']):        ax.annotate(txt, (df['total_params'].iat[i], df['val_ppl'].iat[i]), textcoords='offset points', xytext=(3,3), fontsize=8)    ax.set_title(f'Total Params vs PPL — {title}')    ax.set_xlabel('Total Parameters')    ax.set_ylabel('Validation Perplexity')    ax.grid(True, linestyle='--', alpha=0.5)plt.tight_layout()plt.show()

## Final ConclusionCoTAdapter-style modifications provide an effective, parameter-efficient mechanism to improve perplexity, especially on larger backbones.Key findings:1. **Perplexity scaling:** CoTAdapter outperforms other adapters on GPT-2 Small and Medium. On DistilGPT-2, DeepTransformer performs best due to added depth benefits.2. **Compute trade-offs:** Dominant compute cost shifts from attention (small models) to FFNs (large models).3. **Parameter efficiency:** CoTAdapter maintains lower parameter count while improving performance.These results align with the CoTFormer paper's reported scaling properties.

## Future Scope / Next Experiments1. Increase number of repeats (nrepeat).2. Token-wise adaptive repeats with routing.3. Partial fine-tuning of backbone vs adapter-only training.4. Direct MAC/FLOP measurement using profilers.5. Longer training schedules for adaptive models.