In [None]:
python feature_extraction.py --split train --type video --data-dir ../dataset/data --save-dir ../dataset/data
python feature_extraction.py --split train --type audio --data-dir ../dataset/data --save-dir ../dataset/data

(react) [s5727214@w11907 val]$cd Video_features/NoXI/
(react) [s5727214@w11907 NoXI]$find . -mindepth 2 -type f -name '*.pth' | while read f; do   target="$(dirname "$(dirname "$f")")/$(basename "$f")";   mv "$f" "$target"; done
(react) [s5727214@w11907 NoXI]$find . -type d -empty -delete


---

## 1. `train.py`：入口脚本

1. **参数解析**

   * 解析了数据路径（`--data-dir`）、日志路径、训练细节（batch size、learning rate、epoch 数等）
   * 支持 `--test` 标志来切换到测试模式

2. **模型组装**

   * **CognitiveProcessor**（认知模块）、**PercepProcessor**（感知融合模块）、**LipschitzGraph**（动作生成图）三棵树
   * 用这三者初始化一个 **MHP**（Multi‑Head Processing）主模型，然后搬到 GPU

3. **数据准备**

   * 读 `train.csv`
   * 从第 2/3 列拿到 “speaker” 路径 和 “listener” 路径，拼成两份列表
   * 加载 `neighbour_emotion_train.npy`（邻居情感标签矩阵）
   * 用 `ActionData` 包装成 PyTorch Dataset，再给 DataLoader

4. **训练细节**

   * 捆绑 Adam + WarmupMultiStepLR 调度器
   * 主循环最多跑 100 个 epoch（script 里死写了）；每个 epoch 调 lr，然后调用 `Trainer.train`
   * 每 5 个 epoch 自动存一次模型快照

5. **测试分支**

   * 如果传了 `--test`，则用类似流程：

     * 载入 `test.csv`、`neighbour_emotion_test.npy`
     * 建 `ActionData` + DataLoader（batch 大小设成一次能扫完所有滑窗）
     * 调用 `Trainer.test`，生成预测 `.pth` 存到 `outputs/`

---

## 2. `trainers.py`：Trainer 类

1. **Loss & 准备**

   * 根据 `--loss-name`、`--neighbor-pattern`（nearest、pair、all）选不同 Loss（MSE、Distribution、AllThreMseLoss…）
   * 把“邻居列表”`neighbors`（speaker/listener 路径 ＋ npy 标签）塞进来

2. **`random_select`**

   * 给定一个 sample（“dtype+site+group+pid+clip+idx”），在邻居矩阵里找哪些 listener 是“合适的”
   * 随机抽 ≤10 个 neighbor 用于训练，test 时按顺序全用上

3. **`load_npy`**

   * 根据 neighbor 路径拼出它的 emotion CSV（.csv 存的是 listener 的面部 emotion 时间序列）
   * 读 pd.read\_csv、切到相同帧长后搬到 GPU

4. **`_parse_data`**

   * 收到 DataLoader 的一个 batch：video 特征、audio 特征、targets
   * 把 video/audio push 到 GPU；根据模式“nearest”/“all” 决定是 pairwise 还是全连接
   * 最终返回 `(v_inputs, a_inputs, all_neighbors, lengths)` 四元组给 `train`/`test` 用

5. **`train`**

   * 标准的 training loop：

     * 拿到 inputs + neighbors
     * `model(v_inputs, a_inputs, targets, lengths)` → speaker\_features, listener\_features, …, loss\_det
     * 计算 DTW/MSE 之类的 loss，反向，step optimizer
     * 打点 log（time, data, loss）
     * 遇到第 `train_iters` 个 mini‑batch 就 break（script 里限制了每 epoch 用多少 batch）

6. **`test`**

   * 把模型切 eval，loop over batches
   * 调 `model.inverse(...)` 做生成，跑 `combine_preds` 拼回 750 帧完整结果
   * 保存成 `result-0.pth`…`result-9.pth`（SAMPLE\_NUMS=10）

7. **辅助函数**

   * `combine_preds`：把 sliding window 输出再 overlap‑add 回整条视频长度
   * `modify_outputs`：可选地把前 15 帧 clamp 到 0/1，用于 debug 或 ablation

---

### 🚀 快速小结

* **`train.py`**：读 args → 构模型 → 准备数据/邻居矩阵 → Trainer.train/trainers → 保存模型
* **`trainers.py`**：实现了样本选邻居、载 emotion labels、实际的前向/反向、loss、测试生成等核心步骤
😁


python feature_extraction.py --split train --type video --data-dir <data-dir> --save-dir <data-dir>
python feature_extraction.py --split train --type audio --data-dir <data-dir> --save-dir <data-dir>
python feature_extraction.py --split test --type video --data-dir ../dataset/data --save-dir ../dataset/data
python feature_extraction.py --split test --type audio --data-dir ../dataset/data  --save-dir ../dataset/data 

In [None]:
python evaluation.py --data-dir ../data/react_clean --pred-dir ../data/react_clean/outputs/results split test

-----------------Evaluating Metric-----------------  
Metric: | FRC: 0.19604 | FRD: 84.12176 | S-MSE: 0.00072 | FRVar: 0.00602 | FRDvs: 0.03359 | TLCC: 41.45533  
Latex-friendly --> model_name & 0.20 & 84.12 & 0.0007 & 0.0060 & 0.0336 & - & 41.46 \\


# REGNN Evaluation Results Analysis
数据集中的固定时间分割引入了可能与自然对话单元不匹配的人工边界。REACT 2023和2024中使用的30秒片段可能截断自然反应序列或组合不相关的交互片段，影响模型学习和评估。
## Key Findings

The REGNN evaluation results reveal a **"conservative yet precise"** generation pattern with distinct strengths and weaknesses. The model demonstrates exceptional accuracy and synchronization capabilities while significantly underperforming in diversity metrics.

## Performance Analysis

### **Outstanding Performance**
**S-MSE: 0.00072** indicates extremely high accuracy, with generated reactions closely matching ground truth patterns within local time windows. This excellence stems from REGNN's reversible graph neural network architecture, which ensures generated reactions strictly adhere to learned distributions from real facial behavior. **TLCC: 41.46** demonstrates superior temporal synchronization, reflecting the model's ability to maintain appropriate conversational timing through explicit graph-based modeling of facial feature dependencies and multi-dimensional edge feature learning.

### **Critical Weaknesses**
**FRD: 84.12** reveals severely limited diversity, with multiple predictions from identical inputs showing minimal variation. This stems from overly conservative sampling strategies where the reversible constraints prioritize generation quality at the expense of exploration. **FRVar: 0.00602** confirms low temporal variability within generated sequences, suggesting the model learns overly concentrated distributions with insufficient randomness injection. **FRDvs: 0.03359** indicates poor differentiation between listener reactions and speaker features, potentially reflecting excessive mimicry patterns rather than appropriate responsive behavior.

## Core Issues

REGNN exhibits a **quality-diversity trade-off imbalance**, prioritizing safety over natural human behavioral richness. The reversible architecture creates a double-edged effect: while ensuring anatomically plausible reactions on the real behavior manifold, it overly constrains the exploration space. The explicit graph structure, though beneficial for maintaining facial feature relationships, appears too rigid for generating the natural variability observed in human conversational reactions.

## Implications

Compared to other methods, REGNN achieves superior accuracy metrics but significantly underperforms in diversity measures. This suggests the model successfully addresses the technical challenge of appropriate reaction generation but fails to capture the inherent richness of human behavioral responses. Future improvements should focus on enhancing sampling diversity through relaxed constraints, dynamic graph structures, and explicit diversity regularization while preserving the model's excellent accuracy and synchronization capabilities.

# Regnn Output -- Composition and Significance of 25-Dimensional Facial Expression Vectors

The 25-dimensional facial expression vector provides a comprehensive and scientifically grounded representation of human facial behavior by decomposing complex expressions into quantifiable components. **The first 15 dimensions correspond to Action Units (AUs)** based on the Facial Action Coding System (FACS) developed by Paul Ekman and colleagues. Each AU represents specific facial muscle movements with standardized intensity scales, such as AU12 for lip corner pulling (smiling) and AU4 for brow lowering (frowning). These AUs offer anatomical precision and cross-cultural consistency, enabling objective measurement of facial muscle activations that underlie all human expressions.

**The subsequent 8 dimensions encode facial expression probabilities** for the basic emotions: neutral, happiness, sadness, surprise, fear, disgust, anger, and contempt. These categories are rooted in evolutionary psychology and cross-cultural emotion research, representing universally recognized emotional states. Each dimension provides a confidence score (0-1) indicating the likelihood of that particular expression, allowing for nuanced representation of mixed emotions and transitional states between expressions.

**The final 2 dimensions capture emotional valence and arousal**, based on the circumplex model of affect from emotion psychology. Valence measures the pleasantness-unpleasantness continuum, while arousal quantifies the activation-deactivation dimension. This two-dimensional emotional space provides a continuous representation that complements the discrete expression categories, offering a more complete characterization of affective states that aligns with physiological and psychological research on human emotion.

Together, these 25 dimensions create a structured mathematical space where facial expressions can be objectively measured, compared, and analyzed using standard statistical methods, transforming subjective facial behavior into quantifiable data suitable for computational modeling and evaluation.

In [4]:
import torch
import numpy as np

# 修改为你的 .pth 文件路径
#pth_path = '../data/react_clean/outputs/results/test/NoXI/002_2016-03-17_Paris/Expert_video/5/result-0.pth'
pth_path = '../data/react_clean/outputs/results/test/RECOLA/group-3/P45/1/result-0.pth'
# 加载 .pth 文件
data = torch.load(pth_path, map_location='cpu')

# 打印数据类型
print(f"Loaded data type: {type(data)}")

if isinstance(data, torch.Tensor):
    # 对于 Tensor，打印形状、数据类型以及一些统计信息
    print(f"Shape: {data.shape}")
    print(f"Dtype: {data.dtype}")
    print(f"Min value: {data.min().item()}")
    print(f"Max value: {data.max().item()}")
    print(f"Mean (per-dimension, first 5 dims): {data.mean(dim=0)[:5].tolist()}")
    print(f"Std  (per-dimension, first 5 dims): {data.std(dim=0)[:5].tolist()}")
elif isinstance(data, dict):
    # 如果是字典，查看键和值的类型和形状
    print(f"Keys in dict: {list(data.keys())}")
    for k, v in data.items():
        if isinstance(v, torch.Tensor):
            print(f"  {k}: Tensor with shape {v.shape}, dtype {v.dtype}")
        else:
            print(f"  {k}: {type(v)}")
else:
    # 其他类型
    print("Data loaded is neither Tensor nor dict. Here is a repr:")
    print(data)


Loaded data type: <class 'torch.Tensor'>
Shape: torch.Size([750, 25])
Dtype: torch.float32
Min value: -0.3284480571746826
Max value: 1.1664934158325195
Mean (per-dimension, first 5 dims): [0.3283752202987671, 0.22007928788661957, 0.5471466779708862, 0.8927489519119263, 0.9049168229103088]
Std  (per-dimension, first 5 dims): [0.10018729418516159, 0.07739455997943878, 0.1337902992963791, 0.06093745678663254, 0.02995392307639122]


  data = torch.load(pth_path, map_location='cpu')


# Upgrade to Wav2Vec 2.0  
调整感知融合模块（perceptual.py）
原先的跨模态输入投影，音频一般是 128 维的 VGGish；现在换成 768 维的 wav2vec，就要把第一层的 Conv1d(in_channels=*, out=...) 或者线性层的 in_dim 改成 768。例如在 PerceptualProcessor 里：

class PerceptualProcessor(nn.Module):
    def __init__(self, video_dim=768, audio_dim=768, fuse_dim=64, …):
        super().__init__()
        self.audio_proj = nn.Conv1d(audio_dim, fuse_dim, kernel_size=1)
        self.video_proj = nn.Conv1d(video_dim, fuse_dim, kernel_size=1)

我在 PercepProcessor.forward 里加入了：  
自动时序对齐：用 adaptative_avg_pool1d 把任意长的音频特征 T_a 下采样到视频特征长度 T_v。  
无侵入性：如果 T_a == T_v 则直接使用，兼容 VGGish 原生输入或你新提的 wav2vec。  
调用 MULTModel：保持格式 (1, B, T, D) 传入多模态融合。  

Fuse video and audio features with temporal alignment.  
video_inputs: (B, T_v, C, H, W) or precomputed (B, T_v, D_v)  
audio_inputs: (B, T_a, path) or precomputed (B, T_a, D_a)  
returns: (B, T_out, fused_dim)  

1. 在 datasets.py 里做“特征增强”（Feature Aug）
因为你现在直接读的都是离线算好的 video_features（Swin 输出），没法在线对原帧做图像变换。我们可以在读完特征后，加一点Gaussian noise 或者随机丢帧（dropout），模拟增强效果：
        # ===== 特征级数据增强：对 video_features 加点随机噪声 & dropout =====
        # v_inputs: Tensor [num_frames, feat_dim]
        if self.aug:  # 只在训练时传入 augmentation=True
            # Gaussian Noise
            noise = torch.randn_like(v_inputs) * 0.02  # 标准差可调
            v_inputs = v_inputs + noise
            # Feature Dropout
            v_inputs = F.dropout(v_inputs, p=0.1, training=True)
        a_inputs = self.load_audio_pth(dtype_site_group_pid_clip_idx)
2. 在 trainers.py 里加 时序平滑损失
在 Trainer.train() 的内层 loop，找到这段计算总 loss 的位置，把下面 # –– add smooth loss –– 这块插进去就行：
 
-            loss = loss_dtw + (loss_det if self.cal_logdets else 0.) + loss_mid
+            # —— 新增：时序平滑损失 —— #
+            # listener_features: [B, num_windows, num_frames, feat_dim] 或者 [B, T, D]
+            # 如果是四维，先 reshape到 [B*T, D]
+            lf = listener_features
+            if lf.dim() == 4:
+                # e.g. [B, W, F, D] -> [B*W, F, D]
+                B, W, F, D = lf.shape
+                lf = lf.view(B*W, F, D)
+            # 计算相邻帧差值
+            smooth_loss = torch.mean(torch.abs(lf[:,1:,:] - lf[:,:-1,:]))
+            lambda_smooth = 0.1  # 可以根据验证集再调
+
+            loss = loss_dtw \
+                 + (loss_det if self.cal_logdets else 0.) \
+                 + loss_mid \
+                 + lambda_smooth * smooth_loss

3. MULTModel.py （跨模态融合网络）
输入／投影维度


- self.orig_d_a, self.orig_d_v = 768, 768
+ self.orig_d_a, self.orig_d_v = 768, 768   # 保持 W2V 768
- self.d_a, self.d_v = 128, 128
+ self.d_a, self.d_v = 256, 256   # 投影到更高维，给 Transformer 更多容量
注意力头数 & 层数


- self.num_heads = 4
+ self.num_heads = 8       # 更多头，抓住更丰富的交互

- self.layers = 5
+ self.layers = 6          # 再多一层跨模态注意力
dropout 比例


- self.attn_dropout = 0.1
+ self.attn_dropout = 0.2  # 提高正则，防过拟合

- self.relu_dropout = 0.1
+ self.relu_dropout = 0.2
输出维度


- output_dim = 64
+ output_dim = 128        # 融合后特征再加倍，给后续认知网络更多“弹药”

4. Motor Processor 中的可逆 GNN（LipschitzGraph.py）
num_features = 50 > 64

5. trainer.py 
tensorboardX.SummaryWriter# log epoch metrics
        self.writer.add_scalar('train/loss_dtw', losses_dtw.avg, epoch)
        self.writer.add_scalar('train/loss_mid', losses_mid.avg, epoch)
        self.writer.add_scalar('train/loss_det', losses_det.avg, epoch)
        self.writer.add_scalar('train/smooth_loss', smooth_loss.item(), epoch)

-----------------Evaluating Metric-----------------
Metric: | FRC: 0.0795 | FRD: 137.79676 | S-MSE: 0.00345 | FRVar: 0.02144 | FRDvs: 0.02151 | TLCC: 44.28660
Latex-friendly --> model_name & 0.01 & 375.80 & 0.0035 & 0.0214 & 0.0215 & - & 44.29 \\