### MODEL TEST 01

In order to incorporate a transpose operation in a nn.Sequential operation, we make a 'TransposeLayer' as a nn.Module. 

This is needed because we can't just pass a matrix (2D tensor) to torch's liniar layers. Torch can work with matrixes, but are design to work with then in the context of multiple baches run in parallel.

For that reason, we can't pass multipe collums to the liniar layer, but have to transpose these to become row vetors insted. We can then transpose them back to column vectors after the liniar operation. 

In [None]:
...

We overwrite the nn.Module from pytorch, and encapsulate the model's feed forward process, that can be used both during training and inferance. 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

In [None]:
class DiffuseStyleGestureRecModel(nn.Module):
    def __init__(self, 
                number_of_styles: int,                      # Number of unique styles. In this context this is the number of speakers, since we treat each speaker as a style
                N_gesture_length: int,                      # Length of the sequence snippets to generate. We geneate in autoregressive manner, where we are constantly generating small chunks continously
                N_seed_length: int,                         # Length of the seed that we use. The seed is the number of frames from the previously generated sequence in order to make the generation smooth continous  
                audio_features_per_frame: int,              # Number of audio features per frame. This is a mixture of prosodic features, onsets, wavlm, etc.
                pose_features_per_frame: int):              # Number of pose features per frame. These are the rotations / translations of the bones in the character skeleton. We may not pay attention to every channel for every bone, or every bone. 
        super().__init__()


        # Implemetation of the DiffuseStyleGestureModel based on the paper by YoungSeng et al.
        # We instantiate all learned model layers needed below.

        # The time step encoding MLP. Our best guess is that this is actually a learned position encoding as described in Vaswani et al. 
        # Maybe it could be interesting to investigate using sinosoidal positional encoding? That would be one way to reduce the number of weights and maybe it would run faster?
        # Sinosoidal position embeddings seem to work very well, and in Andrej Kaparthy's llm video series he even investigates the gpt2 weights, and claim that they are not fully
        # converged yet, because they are spiky.
        self.time_step_mlp = nn.Sequential(
            nn.Linear(1, 128),
            nn.ReLU(),
            nn.Linear(128, 256)
        )
        
        # Seed gesture linear layer - for dimensionality reduction
        # We reduce the seed gesture (pose_features_per_frame, N_seed_length) to be just 192. it sure seems like we are compressing the seed a lot. This is for 8 frames, and its barely
        # enough data to store a 10th of a single frame! I guess it captures the broad tendencies in the previous motion. I hope it's enough to make the motion smooth.
        self.seed_gesture_linear = nn.Linear(
            in_features=pose_features_per_frame * N_seed_length, 
            out_features=192
        )
        # TODO: Add as nn.Sequential where we can flatten??

        # Style linear layer - for dimensionality expansion from a onehot encoded (number_of_styles, 1) to (64, 1) shape
        # We move from a one hot encoded format to a 64 dimensional vector. My best guess is that instead of working with individual styles we sort of extract features of the style.'
        # For instance, 2 speakers might share the same general waviness in their gestures, and this can be encoded as a feature which is shared between the two speakers.
        # This is a way to make the model more general, and to make it easier to generalize to new speakers.
        self.style_linear = nn.Linear(
            in_features=number_of_styles, 
            out_features=64
        )
        
        # Audio feature linear layer per frame - for dimensionality reduction
        # The idea is to reduce the number of audio features per frame to a much smaller number. I know that we use wavlm which produces a huge number of embeddings. 
        # From the illustration, it is not clear if only wavlm is passed through the layer, or other audio featuers as well. Something to investigate.
        # TODO: think about transpose, I think we should do it directly in forward pass code stuff
        self.audio_linear = nn.Linear(
            in_features=audio_features_per_frame, 
            out_features=64
        )

        # Noisy gesture sequence linear layer - for dimensionality reduction
        # Must be applied to each frame vector in the sequence tensor, individually
        # Here we go from pose feature dimension (1141) to 256. I guess we are representing the pose in a more general way. 
        # Of course the vast majority of combinations of rotations of limbs are highly unlikely, so it makes some sense that we compress it a lot.
        # Based on the idea of "nice" and "ugly" numbers which we got from Andrej Kaparthy, we are thinking that it might be good to change the number 1141. 
        # If we can get rid of / add some extra featuers to pay attention to it might make it faster.
        self.noisy_gesture_linear = nn.Linear(
            in_features=pose_features_per_frame, 
            out_features=256
        )
        # Attention layers
        # TODO: Make number of heads and number of attention layers a hyper-parameter
        
        # TODO: Use the cross_local_attention function from the paper
        self.cross_locale_attention = nn.MultiheadAttention(
            embed_dim=256, 
            num_heads=8, 
            batch_first=True,
        )
        
        self.self_attention = nn.MultiheadAttention(
            embed_dim=256, 
            num_heads=8, 
            batch_first=True,
        )
        
        # Final transformation, creating the output
        self.final_linear = nn.Linear(256, pose_features_per_frame)
    
    def forward(self, 
                frames_are_culumns: bool,

                t_current_defusion_time_step:int,
                seed_gesture_tensor, 
                one_hot_style_tensor, 
                audio_features_tensor, 
                noisy_gesture_sequence_tensor,

                apply_random_mask_to_seed: bool, 
                apply_random_mask_to_style: bool):
        
        # 1.1 - Prepare the diffusion time step t

        # 1.1.1 - Add positional encoding to the t (current time step) using a MLP
        #       Producing a tensor of shape (256, 1)
        t_tensor = self.time_step_mlp(t_current_defusion_time_step.view(-1, 1))  # (256, 1)
        

        # 1.2 - Prepare the seed gesture
        
        # 1.2.1 - mask the seed gesture if apply_random_mask_to_seed is True
        if apply_random_mask_to_seed:
            seed_gesture_tensor = torch.zeros_like(seed_gesture_tensor)
        
        # 1.2.2 - Apply a linear layer to get a tensor of shape (192, 1) 
        #         We reshape the seed gesture tensor before applying the linear layer
        #         using view(-1) to flatten the tensor
        seed_gesture_tensor = self.seed_gesture_linear(seed_gesture_tensor.flatten())  # (192, 1)

        # 1.3 - Prepare the style tensor

        # 1.3.1 - Apply a linear layer to get a tensor of shape (64, 1)
        
        style_tensor = self.style_linear(one_hot_style_tensor) 

        # 1.3.2 - Mask the style if apply_random_mask_to_style is True
        if apply_random_mask_to_style:
            style_tensor = torch.zeros_like(style_tensor)


        # 1.4 - Prepare the audio features tensor
        #       Apply a linear layer to get a tensor of shape (64, N) - every column is the features for that frame
        #       TODO: Consider making this 128 to make the final tenser of shape (640, N) (Nice number)
        audio_features_tensor = self.audio_linear(audio_features_tensor)
        
        # 1.5 - Prepare the noisy gesture sequence tensor (1141, N)
        #       Apply a linear layer to get a tensor of shape (256, N) - every column is the features for that frame
        #       The linear layer is applied per frame
        noisy_gesture_sequence_tensor = self.noisy_gesture_linear(noisy_gesture_sequence_tensor)

        # 2 - Combine input tensors to get the input tensor for the model

        # 2.1 - Concatenate the transformed seed gesture tensor, style tensor
        #       To get a tensor of shape (256, 1), called the seed_style_t_tensor
        seed_style_t_tensor = torch.cat((seed_gesture_tensor, style_tensor), dim=0)
        
        # 2.1.1 - And add the time step tensor using element wise addition
        seed_style_t_tensor += t_tensor

        # 2.2 - Concatenate the audio features tensor and Noisy gesture sequence tensor
        #       To get a tensor of shape (320, N)
        audio_noisy_gesture_tensor = torch.cat([audio_features_tensor, noisy_gesture_sequence_tensor], dim=0)  # (320, N)
        
        # 2.3 - replicate the seed_style_t_tensor to get a tensor of shape (256, N)

        # 2.4 - Concatenate the seed_style_t_tensor and the audio_features_noisy_gesture_sequence tensor
        #       To get a tensor of shape (576, N) (Could be nicer)
        #       This gives us the 'input_tensor' for the model


        # 3 - The Attention layers

        # 3.1 - Add RPE (Relative Positional Encoding) to the input_tensor

        # 3.2 - apply Cross-Locale Attention to the RPE'ed input tensor

        # 3.3 - Pass the output of the Cross-Locale Attention to a liniear layer 
        #       to get a tensor of shape (256, N)

        # 3.4 - Concatenate the output of the liniear layer with the seed_style_t_tensor
        #       To get a tensor of shape (256, N+1)

        # 3.5 - Apply a self attention layer to the tensor of shape (256, N+1)

        # 3.6 - Pass the output of the self attention layer to a liniear layer,
        #       to get a tensor of shape (1141, N)

        # 4 - Return the output of the liniear layer
