Source Code for my master thesis "2D-MapFormer: 2D-Map Transformer for Audio-Visual Scene-Aware Dialogue and Reasoning" (Currently not published).
The Source Code is derived from
- Requirments
- conda
- wandb
- Environments Setting
. ./setup.sh
- Download I3D and VGGish pretrained features
. ./download_data.sh python3 utils/combine_files.py # combine feature files into ./data/features/train.pkl and ./data/features/test.pkl
- Train model
- Specify the
exp_name
in therun.sh
. The trained model and model outputs will stored in./log/{exp_name}/
. It will also be the experiment name of wandb - Specify the
procedure='train_test'
- Specify other hyperparameters. Please see
run.sh
andmain.py
for more details. - run
. ./run.sh
.- It will run training and testing automatically
- You will see the following procedure in the command line
train 15, tan:0.125, dig:2.272: 100%|█████| 4787/4787 [21:15<00:00, 3.75it/s] train 15, tan:0.112, dig:2.153 val 15, tan:0.087, dig:1.985: 100%|█████| 1117/1117 [06:12<00:00, 3.00it/s] val 15, tan:0.109, dig:2.295 The best metric was for 0 epochs. Expected early stop @ 19 train 16, tan:0.094, dig:2.097: 100%|█████| 4787/4787 [21:10<00:00, 3.77it/s] train 16, tan:0.112, dig:2.136 val 16, tan:0.088, dig:2.005: 100%|█████| 1117/1117 [06:11<00:00, 3.01it/s] val 16, tan:0.109, dig:2.298
- You will see the following test result in the command line
DSTC10_beam_search result: | Bleu_1: 68.7000 | Bleu_2: 55.5832 | Bleu_3: 45.4938 | Bleu_4: 37.5887 | METEOR: 24.3038 | ROUGE_L: 53.4955 | CIDEr: 86.9928 | IoU-1: 54.7007 | IoU-2: 57.6148
- Specify the
Model Overview |
Audio Visual Encoder |
Sentence Cross Attention |
Update Gate |