In [1]:
import pandas as pd
import numpy as np
from equivariant_projected_diffusion import ClimbsFeatureArray

### So my attempt at a "Simple" Diffusion model might not have gone entirely to plan.

There are a few mistakes with this 'simple' architecture that make it painfully inefficient to train.

1. **Flattening the Sequence Data**: By flattening the input data into a single vector of length 243, I've destroyed information: The DDPM will not know at the outset that the 'hold 1 features' [0-11] are *distinct* from the 'hold 2 features' [12-13], etc. I didn't realize that DDPMs could automatically handle sequence data, but it turns out there are multiple paradigms for accomplishing just that, such as **U-Nets** and **Transformer DDPMs**. It's better to keep my data sequential than flatten it out into a single vector.
2. **Appending Conditional Features**: The features **grade**, **rating**, and **ascents** are not really things we need the diffusion model to recreate. Rather, they represent *conditions* which we should be able to apply to the DDPM to generate different climbs. Conditional features are usually passed to the diffusion model separately, via **Cross-Attention** (Transformers) or **Adaptive Group Normalization** (U-Nets). I should separate mine from the climb features.
3. **Null Hold Padding**: Okay, both Claude and Gemini called me out for this, but I think they're overreacting. The issue they're pointing out is that because I've zero-centered my climbs at the start holds, *both the start holds and the null holds have position [0,0,0]*. As a result, the concern is that my model will confuse the two. Now, in truth, the start holds have a nono-zero **pull-value**, making them fairly distinct from the Null holds. However, I can see merit in setting the position of the null holds to [-10,-10,-10] or something similar.
4. **'Normalizing' X,Y to [-1,1]**: Okay, this one I should have caught immediately. When converting my position to 3d coordinates, I perform a linear transformation on the **X** and **Y** coordinates to create the new **X**, **Y**, and **Z** coordinates. This means that **I can't move the origin** before applying this transformation. It will mess it right up! So I'll have to rescale these climbs *after* converting them into the feature space.

### I suspect that when using DAS, it is perhaps useful to distinguish between the "Discrete Absorbing State" Null, and the Null used to represent null holds, i.e. sequence padding. Why?

Imagine you have the following starting sequence: $[A, B, Null, Null]$. You perform the forward discrete diffusion process onto it, resulting in the following steps: $$1. [A, B, Null, Null]\rightarrow2. [A, Null, Null, Null] \rightarrow 3. [Null, Null, Null, Null]$$ Now imagine what this process might look like if we encode "No hold" and "discrete absorbing null" separately (Here I'll use $\Phi$ to represent the DAS. $$1. [A, B, Null, Null] \rightarrow 2. [A, \Phi, \Phi, Null] \rightarrow 4. [\Phi, \Phi, \Phi, \Phi]$$ Now consider this. If we ask the denoising model to remove the noise from step **2** in each process, is it performing the same task in both cases?

**No!** The Denoising model actually has *extra information about our original dataset* in version 2. It knows for a fact that element 4 is $Null$. Thus, differentiating the "True Null" from the "Discrete Absorbing State Null" gives our model extra information which it can use to improve its predictions during training!