# Lab 7: Human motion generation

## Advanced deep learning

## Setup

### Setup dataset


Please access this Google Drive folder: [link](https://drive.google.com/drive/folders/1V5yzlwBPSNVPj33SfDHnvMykXISh3CyB?usp=sharing) and create a shortcut in the root of your Google Drive `/content/drive/MyDrive/`.

In [2]:
from google.colab import drive
drive.mount('/content/drive')
!ls /content/drive/MyDrive/humanml3d-data

Mounted at /content/drive
caption_clip  caption_raw  checkpoints	humanml3d_test_split.txt  smplh  smpl_rifke


### Setup environment

Make sure you're running on a T4 GPU Colab instance; if not, activate it.

In [3]:
!nvidia-smi

Fri Feb 28 19:44:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   57C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
import torch
torch.cuda.is_available()

True

Clone the lab repository.

In [None]:
# !git clone https://github.com/robincourant/lab7-CSC52087EP.git
# !git clone https://github.com/robincourant/lab-MotionDiT.git
!git clone https://github.com/SebastienJin/lab7-CSC52087EP.git # I replace the repo with mine so that I can use the latest version of the code

fatal: destination path 'lab7-CSC52087EP' already exists and is not an empty directory.


In [4]:
%cd lab7-CSC52087EP
# %cd lab-MotionDiT
!ln -s /content/drive/MyDrive/humanml3d-data ./ # Plug the dataset in the repo

/content/lab7-CSC52087EP
ln: failed to create symbolic link './humanml3d-data': File exists


Install required libraries

In [2]:
!pip install hydra-core
!pip install pyrender
!pip install smplx
!pip install torchtyping
!pip install lightning
!pip install ema_pytorch



## Human motion dataset and representation

### HumanML3D dataset

#### Question 1:
*Answer:*

According to the description on the website "paper with code" (https://paperswithcode.com/dataset/humanml3d), the HumanML3D dataset consists of 14,616 motions and 44,970 descriptions composed by 5,371 distinct words. The total length of motions amounts to 28.59 hours. The average motion length is 7.1 seconds, while average description length is 12 words.



### SMPL representation


#### Question 2:
*Answer:*

The SMPL model mainly uses the following two sets of input parameters to infer the mesh vertices:

1. Pose Parameters: A 72-dimensional vector (i.e., 24 joints, with each joint represented by 3 numbers indicating rotation, usually in the axis-angle representation), used to describe the rotation of each joint relative to the default pose.
These parameters define the overall movement and posture of the human body and are key for capturing dynamic variations in the model.

2. Shape Parameters: A 10-dimensional vector, where each dimension is derived from the principal components obtained by performing PCA on a large dataset of 3D body scans. The shape parameters describe individual differences in body shape, such as variations in body size and proportions (e.g., slim, heavy, tall, or short), enabling the generation of a mesh that conforms to a specific human form.

#### Code 2
*Complete `visualize_smpl.py`*

In [None]:
!HYDRA_FULL_ERROR=1 PYTHONPATH=$(pwd) python src/visualize_smpl.py

import moviepy.editor
moviepy.editor.ipython_display("./smpl.mp4")

[2025-02-24 06:42:08,845][numexpr.utils][INFO] - NumExpr defaulting to 2 threads.
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
Number of frames F: 197
Number of features d: 205

Caption:
a man stands up, walks clock-wise in a circle, then sits back down.


## Model architectures

### Config A: Incontext

#### Code 3
*Complete `src/models/modules/incontext.py`*

In [6]:
!PYTHONPATH=$(pwd) python src/models/modules/incontext.py

Test passed!


### Config B: AdaLN

#### Question 3:
*Answer:*

$$ AdaLN(x, \gamma, \beta) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$

where $\mu$ and $\sigma$ are the mean and standard deviation of $x$ along the normalization dimensions.

#### Code 4
*Complete `src/models/modules/adaln.py`*

In [14]:
!PYTHONPATH=$(pwd) python src/models/modules/adaln.py

Test passed!


### Config C: Cross attention

#### Question 4:
*Answer:*

$$
CA(x, c)
= \mathrm{Softmax}\Bigl(\frac{x W_Q \,\bigl(c W_K\bigr)^{T}}{\sqrt{d}}\Bigr)\; c W_V
$$


#### Code 5
*Complete `src/models/modules/cross attention.py`*

In [5]:
!PYTHONPATH=$(pwd) python src/models/modules/cross_attention.py

Test passed!


#### Question 5:
*Answer:*

Compared to cross-attention, AdaLN injects conditional information directly at the normalization step, letting each layer adapt its scaling and shifting based on the condition. This is more parameter-efficient and often more stable on integrating conditions, as it doesn't rely on additional attention operations.


## Diffusion framework

### DDPM

#### Code 6
*Complete `src/training/losses/ddpm.py`*

In [6]:
!PYTHONPATH=$(pwd) python src/training/losses/ddpm.py

Test passed!


#### Code 7
*Complete `src/training/sampler/ddpm.py`*

In [8]:
!PYTHONPATH=$(pwd) python src/generate.py batch_size=1 diffuser/sampler@diffuser.test_sampler=ddpm seed=2

import moviepy.editor
moviepy.editor.ipython_display("./generation_ddpm_incontext.mp4")

Seed set to 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 20:00:13.252228: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740772813.272099    5477 cuda_dnn.cc:8310] Unable to register cuDNN factory: A

### DDIM

#### Question 6:
*Answer:*

DDIM speeds up sampling by using a non-Markovian, deterministic reverse process rather than the Markovian chain in DDPM. This allows DDIM to skip steps and generate samples in fewer iterations without a significant loss in quality. The key difference lies in the way the reverse process is derived: DDPM injects noise at every step in a Markov chain, whereas DDIM treats the reverse diffusion as an ODE, enabling more flexible and efficient sampling.

#### Code 8
*Complete `src/training/sampler/ddim.py`*

In [9]:
!PYTHONPATH=$(pwd) python src/generate.py batch_size=1 diffuser/sampler@diffuser.test_sampler=ddim seed=2

import moviepy.editor
moviepy.editor.ipython_display("./generation_ddim_incontext.mp4")

Seed set to 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 21:20:25.103732: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740777625.131380   25323 cuda_dnn.cc:8310] Unable to register cuDNN factory: A

#### Question 7:
*Answer:*

DDPM employs a completely random diffusion process, so the generated images tend to be highly diverse but may exhibit more noise or blurred details. In contrast, DDIM uses a non-Markovian deterministic sampling strategy that speeds up the sampling process and produces clearer, more consistent images, albeit with slightly reduced diversity.

## Result analysis

### Qualitative analysis

#### Code 9

In [10]:
!PYTHONPATH=$(pwd) python src/generate.py batch_size=1 diffuser/network=incontext \
checkpoint_path=./humanml3d-data/checkpoints/incontext.ckpt

import moviepy.editor
moviepy.editor.ipython_display("./generation_ddpm_incontext.mp4")

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 21:22:42.453947: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740777762.485286   25977 cuda_dnn.cc:8310] Unable to register cuDNN factory: 

In [15]:
!PYTHONPATH=$(pwd) python src/generate.py batch_size=1 diffuser/network=adaln \
checkpoint_path=./humanml3d-data/checkpoints/adaln.ckpt

import moviepy.editor
moviepy.editor.ipython_display("./generation_ddpm_adaln.mp4")

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 21:36:05.609397: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740778565.628769   29674 cuda_dnn.cc:8310] Unable to register cuDNN factory: 

In [12]:
!PYTHONPATH=$(pwd) python src/generate.py batch_size=1 diffuser/network=cross_attention \
checkpoint_path=./humanml3d-data/checkpoints/cross_attention.ckpt

import moviepy.editor
moviepy.editor.ipython_display("./generation_ddpm_cross_attention.mp4")

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 21:26:30.881250: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740777990.912042   27068 cuda_dnn.cc:8310] Unable to register cuDNN factory: 

#### Question 8
*Answer:*

Yes, the samples generated by different architectures have their own focus areas. Videos produced using the incontext architecture tend to be somewhat stiff and may lack rich details, but they align well with the given conditions. Videos generated with the adaln architecture emphasize smoother and more natural motion continuity, while the samples produced using the cross_attention architecture exhibit richer details and more pronounced motion responses.

### Quantitative analysis

#### Question 9:
*Answer:*

The main assumption is that both the reference and generated feature distributions can be approximated as multivariate Gaussians. This means that the features extracted are assumed to be sufficiently described by their mean and covariance, allowing the Fréchet Distance to be computed in closed form.

#### Code 10
*Complete src/metrics/frechet.py*

#### Code 11
*Complete src/metrics/similarity.py*

#### Bonus 1:
*Answer:*

**R1, R2, R3**

The metrics R1, R2, and R3 are retrieval precision measures that indicate how often the correct match is found within the top 1, 2, or 3 ranked candidates, respectively:
 - R1 (Recall at 1): Measures the percentage of queries for which the correct item is the very top retrieval. A high R1 means that the model often ranks the correct match as the best candidate.
 - R2 (Recall at 2): Indicates the proportion of queries where the correct item is found within the top two predictions. This relaxes the condition slightly compared to R1.
 - R3 (Recall at 3): Reflects the frequency with which the correct match appears in the top three results, further loosening the requirement.

The code computes these metrics by comparing the rank order of distances between text and character features. For each sample, it checks whether the correct match appears in the first k predictions, and then averages these counts over all samples to produce R1, R2, and R3.

**PRDC**

PRDC is an evaluation framework used primarily for assessing the quality of generative models by comparing the “manifold” of real data to that of generated data. PRDC stands for:
 - Precision: This measures the quality of generated samples. It quantifies the proportion of fake samples that are “realistic” in the sense that each lies close enough.
 - Recall: This measures the diversity of the generated samples. It computes the fraction of real samples that are “covered” by the fake data.
 - Density: While precision tells you whether fake samples are close to the real data, density goes further by quantifying how many fake samples, on average, fall into the local neighborhoods defined by the real data.
 - Coverage: This metric indicates the proportion of the real data manifold that is “covered” by the generated samples.

Together, these metrics provide a nuanced view of generative performance—assessing not just whether the generated samples look realistic (precision) but also whether they are diverse enough (recall), how densely they populate the real data regions (density), and whether the generator misses any parts of the real data distribution (coverage).

#### Code 12

In [16]:
!PYTHONPATH=$(pwd) python src/evaluate.py diffuser/network=incontext \
checkpoint_path=./humanml3d-data/checkpoints/incontext.ckpt

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
[2025-02-28 22:11:05,150][OpenGL.acceleratesupport][INFO] - No OpenGL_accelerate module loaded: No module named 'OpenGL_accelerate'
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 22:11:08.069505: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin c

In [17]:
!PYTHONPATH=$(pwd) python src/evaluate.py diffuser/network=adaln \
checkpoint_path=./humanml3d-data/checkpoints/adaln.ckpt

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
[2025-02-28 22:21:14,503][OpenGL.acceleratesupport][INFO] - No OpenGL_accelerate module loaded: No module named 'OpenGL_accelerate'
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 22:21:17.810023: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin c

In [18]:
!PYTHONPATH=$(pwd) python src/evaluate.py diffuser/network=cross_attention \
checkpoint_path=./humanml3d-data/checkpoints/cross_attention.ckpt

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
[2025-02-28 22:29:33,041][OpenGL.acceleratesupport][INFO] - No OpenGL_accelerate module loaded: No module named 'OpenGL_accelerate'
  torch.load(motion_ckpt_path, map_location=self.device)
  torch.load(text_ckpt_path, map_location=self.device)
  self.feat_mean = torch.load(standardization["feat_rifke"]["mean_path"])
  self.feat_std = torch.load(standardization["feat_rifke"]["std_path"])
  self.tmrrifke_mean = torch.load(standardization["tmr_rifke"]["mean_path"])
  self.tmrrifke_std = torch.load(standardization["tmr_rifke"]["std_path"])
  checkpoint = torch.load(config.checkpoint_path, map_location=torch.device("cpu"))
2025-02-28 22:29:36.654673: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin c

#### Question 10:
*Answer:*

From these metrics, there are clear differences in performance among the different architectures. 

1. FDTMR:
   - adaln has the lowest FD (approximately 55.91), indicating that its generated samples are closer to the real data distribution.
   - cross_attention follows with an FD of about 90.80.
   - incontext has the highest FD (around 137.85), showing that its generated results deviate more significantly from the real data.

2. tmr_score:
   - Although cross_attention's tmr_score is slightly higher (approximately 0.6463), followed by adaln (around 0.6197), incontext scores the lowest (approximately 0.5850). However, the differences in tmr_score alone are not very significant; it is important to consider these results alongside the other metrics.

3. Precision, Recall, Density, and Coverage:
   - adaln performs best across these metrics, with a precision of 0.9859, recall of 0.8875, density of 1.1328, and coverage of 0.9469. This indicates that its generated samples are not only of high quality but also cover the real data distribution comprehensively.
   - cross_attention comes next in most indicators. Although it shows certain advantages in some sub-metrics (such as R1 and R3), overall its performance does not surpass that of adaln.
   - incontext shows relatively lower values across these metrics, suggesting that its sample quality and diversity are insufficient.

4. R1, R2, R3:
   - These metrics further support the conclusions above. While cross_attention has a slight advantage in R1 and R3, overall adaln’s superiority in FID, recall, and coverage is more pronounced, indicating that its generative model is more effective in capturing the real data distribution.

Considering all these metrics, the adaln architecture performs best in terms of both the authenticity and diversity of the generated samples. Although cross_attention has slight advantages in certain sub-metrics, its overall performance is still inferior to that of adaln. The incontext architecture lags behind in all aspects, suggesting a larger gap between its generated results and the real data, and potentially indicating issues such as mode collapse.

#### Bonus 2:
*Answer:*

Evaluating on only 10×64 samples (640 samples total) can lead to high variance and less reliable estimates for these metrics. For example, the FDTMR score is sensitive to sample size because it estimates the mean and covariance of feature distributions. With a small sample, these estimates can be noisy, causing the metric to fluctuate more than it would with thousands of samples. Similarly, metrics PRDC (precision, recall, density, and coverage) can also suffer from sample insufficiency. With fewer points, the local neighborhoods are not as well-sampled, which may lead to unreliable approximations of how well the fake data covers the real data distribution.

What should be done:
 - Increase the sample size for evaluation to get a more robust and stable estimate of each metric.
 - Consider running multiple evaluations to quantify the variance of these metrics.
 - Complement these quantitative metrics with qualitative evaluations to get a more holistic view of model performance.