# 1.预处理——提取视频特征

## 算法介绍

提取视频特征是视频描述的第一步，其包括以下两个步骤：

1.使用ffmpeg从每个视频中提取帧图片

2.根据采样帧数采样图片，利用训练好的CNN模型，对每一个图片提取特征

## 算法使用
extract_feats(params,model,load_image_fn)

## 参数介绍

dst: 保存帧图片的文件夹位置

output_dir：保存视频特征.npy的文件夹位置

video_path：视频所在文件夹路径

model：提取特征的CNN模型，实例中选用resnet152

n_frame_steps：每个视频采样的帧数

## 命令实例

本实验中使用的是预训练的 ResNet152 网络对视频画面进行图像特征的提取；

该模型网络基于 Image-Net 进行训练，网络层数为 152 层；网络对每一个输入的图片输出一个 2048维的特征向量；

在此基础上，对每个视频画面挑选 40 帧，40 帧画面单独输入网络，得到图像特征，最终每个视频得到一个尺寸为 [40, 2048] 的视频图像特征张量；

为节省时间方便演示，提取特征的视频数目为70个

In [3]:
# 导入所需要的模块
import shutil
import subprocess
import glob
from tqdm import tqdm
import numpy as np
import os
import argparse

import torch
from torch import nn
import torch.nn.functional as F
import pretrainedmodels
from pretrainedmodels import utils

In [4]:
C, H, W = 3, 224, 224 # 特征提取模型所要求的每个视频帧的维度，由于采用ResNet152提取特征，故将C、H、W设置为3、244、244

### 从视频中提取帧图片

输入视频，得到视频帧图片，帧图片暂时保存在dst文件夹中

### **题1：请补全提取帧图片的代码**  
（1）要求实现的功能为：  
> - 判断dst文件夹是否存在。若存在，删除dst文件夹中的所有文件；否则，创建dst文件夹  
> - 提取输入视频`video`的帧图片，将这些图片保存到`dst`文件夹中  
    
（2）提示：使用ffmpeg提取视频中的图片。
> 在命令行中执行以下指令，可提取视频的帧图片：
> > `ffmpeg -y -i ${视频路径} -vf scale=400:300 -qscale:v 2 ${输出图片的路径及命名格式}`  

> 示例：  
    > >`ffmpeg -y -i test.mp4 -vf scale=400:300 -qscale:v 2 frames/%06d.jpg`  
> 在命令行执行该指令，将会提取视频`test.mp4`的帧图片，并将其保存至`frames`文件夹中，命名格式为6位有效数字+后缀名（如`000000.jpg`、`000001.jpg`）

In [6]:
def extract_frames(video, dst):
    '''
    :param video: 视频的路径
    :param dst: 存放输出帧图片的路径
    :return: None
    '''
    ########## 请补全代码 ##########

+ 函数测试   
如果书写的extract_frames函数正确，运行下面测试程序后，“test-folder/题1/frames”文件夹里的“测试文件.txt”将会被删除，同时该文件夹下将会出现从视频“test-folder/题1/test.mp4”中提取的帧图片

In [8]:
extract_frames('test-folder/题1/test.mp4', 'test-folder/题1/frames')

 cleanup: test-folder/题1/frames/


### 提取特征

使用resnet152提取视频特征，提取后的特征`.npy`文件保存在`output/train-video`中，每一个视频得到大小为`[40, 2048]` 的特征张量

### **题2：请补全提取视频特征的代码**  
+ 该段代码要求实现的功能为：  
    - 1.调用`extract_frames`函数，提取`video`的帧图片，图片保存至`dst`目录中；
    - 2.在提取出来的图片中按名称顺序进行均匀采样，要求采样出`params['n_frame_steps']`张图片，并将这些图片保存到`image_list`中； 
    > - **提示**       
    >（1）**采样方法**：使用`np.linspace()`进行采样。例如有编号0-9的10张图片，需要采样出8张图片：使用`np.linspace(0, 9, 8)`进行采样，得到`[0, 1.5, 3, 4.5, 6, 7.5, 9]`，再进行四舍五入（可使用`np.round()`），得到`[0, 2, 3, 4, 6, 8, 9]`    
    > （2）**保存格式**：为方便读取，`image_list`中可保存采样后图片的路径。例如`dst`中有编号`0-9`的10张图片，采用（1）中的采样方式后，`image_list`中保存的内容应该为`["dst/0.jpg", "dst/2.jpg", "dst/3.jpg", "dst/4.jpg", "dst/6.jpg", "dst/8.jpg", "dst/9.jpg"]`  
    - 3.利用`load_image_fn()`方法将`image_list`中的所有图片转换成`tensor`格式，并保存至`images`中，这段代码完成后，`images`的维度应该为`(len(image_list), C, H, W)`；
    >  - **`load_image_fn()`使用方法**  
    >```python
    img = 'dst/0.jpg'
    image = load_image_fn(img) # load_image_fn会将图片转换成维度为(C, H, W)的tensor形式，C、H、W已在之前定义出
    ```

In [33]:
def extract_feats(params, model, load_image_fn):
    '''
    :param params: 包含一系列可自行设置的参数
    :param model: 提取特征所用的模型
    :param load_image_fn: 读取图片并将图片转换成所用模型对应的格式
    '''
    global C, H, W
    model.eval()

    dir_fc = params['output_dir']    
    if not os.path.isdir(dir_fc):
        os.mkdir(dir_fc)
    print("save video feats to %s" % (dir_fc))
    video_list = glob.glob(os.path.join(params['video_path'], '*.mp4'))  # 获取video_path下的所有.mp4文件
    # print(video_list)
    
    for video in tqdm(video_list):  # tqdm：进度条
        video_id = video.split("/")[-1].split(".")[0]   
        dst = params['model'] + '_' + video_id 
        
        ########## 请补全代码 ##########
           
        with torch.no_grad(): 
            fc_feats = model(images).squeeze()
       
        img_feats = fc_feats.cpu().numpy()  # 用训练好的CNN模型，对每一个视频的图片提取特征  n_frame_steps*2048
        outfile = os.path.join(dir_fc, video_id + '.npy')
        outfile = outfile.replace('\\','/')
        np.save(outfile, img_feats)  # 保存特征
        shutil.rmtree(dst)           # 清除dst文件内容

In [34]:
params={"output_dir":"output",'video_path':"train-video",'model':"resnet152",'n_frame_steps':40}

model = pretrainedmodels.resnet152(pretrained='imagenet')
load_image_fn = utils.LoadTransformImage(model)
model.last_linear = utils.Identity()
model = nn.DataParallel(model)
model = model.cuda()

In [35]:
extract_feats(params,model,load_image_fn)

  0%|                                                                                           | 0/74 [00:00<?, ?it/s]

save video feats to output
 cleanup: resnet152_train-video\G_00100/


100%|██████████████████████████████████████████████████████████████████████████████████| 74/74 [01:26<00:00,  1.16s/it]


# 2.预处理——提取词向量

## 算法介绍

包括分词统计与词序列化两个部分

分词统计：对视频的描述文字进行格式化的数据格式转换；

- 首先对英文描述进行分词，将完整的句子拆分成单独的英文单词；

- 在句子的开始添加"sos"标记，表明句子开始；句子结束为止添加"eos"标记，表明句子结束；
    
- 对所有英文单词进行词频统计，设置词频阈值，单词词频低于该阈值时，将该单词舍弃，并以“UNK”作为标记进行替代；
 
词序列化：将所有单词进行索引化操作，即为单个单词建立索引，实现单词文本到数字序列之间的映射；

- 建立“ix_to_word”对象，实现由索引提取单词的操作；

- 建立“word_to_ix”对象，实现由单词转化为索引的操作；

## 算法使用

main(params)

## 参数介绍

input_json：视频信息与描述语句信息json文件路径

caption_json:输出描述语句文件路径

info_json：输出词汇文件路径

word_count_threshold：阈值，词汇出现次数超过阈值会被记录；本次试验中词频阈值选择为1 

bad_words：出现次数小于等于word_count_threshold的词

total_words：总词数

UNKs：Unknown Words，用来替代bad_words


## 命令实例

In [36]:
import re
import json
import argparse

### **题3：请补全下面的`build_vocab()`函数代码**  
+ **参数描述**  
    `ws`是一条去除特殊符号的描述语句  
    `counts`是一个字典，保存了各个单词出现的次数  
    
要求将`ws`转换成列表形式（每个单词作为一个元素），当单词出现次数小于等于阈值`count_thr`时，用`<UNK>`代替。然后在开头加上`<sos>`标记，在结尾加上`<eos>`标记，最后将转换后的描述语句保存到变量`caption`中。
+ **示例**
```python
代码
ws = "A man wears a black tie"
count_thr = 1
counts = {'A': 3, 'man': 2, 'wears': 2, 'a': 3, 'tie': 2}
# 你的代码 #
print('caption: ', caption)
输出
caption: ['<sos>', 'A', 'man', 'wears', 'a', '<UNK>', 'tie', '<eos>']
```

In [37]:
def build_vocab(vids, params):
    count_thr = params['word_count_threshold']  

    counts = {}
    for vid, caps in vids.items():  
        for cap in caps['captions']:
            ws = re.sub(r'[.!,;?]', ' ', cap).split() # 忽略符号
            
            for w in ws:
                counts[w] = counts.get(w, 0) + 1      #指有w时返回其值，默认是0，+1能够累计次数
    # print(counts)  # 统计每个词对应的个数
    total_words = sum(counts.values())       # 总词数
    bad_words = [w for w, n in counts.items() if n <= count_thr] # 坏词
    # print(bad_words)
    vocab = [w for w, n in counts.items() if n > count_thr] # 好词
    # print(vocab)
    bad_count = sum(counts[w] for w in bad_words)  # 坏词数
    print('number of bad words: %d/%d = %.2f%%' %
          (len(bad_words), len(counts), len(bad_words) * 100.0 / len(counts)))
    print('number of words in vocab would be %d' % (len(vocab), ))
    print('number of UNKs: %d/%d = %.2f%%' %
          (bad_count, total_words, bad_count * 100.0 / total_words))
   
    if bad_count > 0:
        # 添加UNK，映射到坏词 ，例如['<sos>', 'A', 'man', 'wears', 'a', '<UNK>', 'tie', '<eos>']
        print('inserting the special UNK token')
        vocab.append('<UNK>')
    for vid, caps in vids.items():
        caps = caps['captions']
        vids[vid]['final_captions'] = []
        for cap in caps:
            ws = re.sub(r'[.!,;?]', ' ', cap).split()
            
            ########## 请补全代码 ##########
            
            #print(caption)
            vids[vid]['final_captions'].append(caption)
    return vocab

分析输入json文件中的描述语句，将单词分为vocab和bad_words（出现次数小于word_count_threshold），并使用UNK替代bad_words。返回词汇表

### **题4：请补全下面的`main()`函数代码**  
`videos`的格式为:
```
{
  "caption": "A group of people are holding an international conference",
  "sen_id": 0,
  "video_id": "G_00100"
},
{
  "sen_id": 1,
  "video_id": "G_00100",
  "caption": "There is a lot of water and flowers on the table"
},
{
  "caption": "China and Panama Leaders'Meeting",
  "sen_id": 2,
  "video_id": "G_00100"
},
{
  "caption": "A group of people are applauding",
  "sen_id": 0,
  "video_id": "G_00101"
},
{
  "caption": "Xi Jinping Meeting with President of Panama",
  "sen_id": 1,
  "video_id": "G_00101"
},
{
  "caption": "People attend the Economic and Trade Cooperation Forum",
  "sen_id": 2,
  "video_id": "G_00101"
}
```
请将其转换成以下格式，并保存到字典`video_caption`中:
```
{
    "G_00100": 
        {"captions": 
            ["A group of people are holding an international conference", 
             "There is a lot of water and flowers on the table", 
             "China and Panama Leaders'Meeting"]
        }, 
    "G_00101": 
        {"captions": 
            ["A group of people are applauding", 
             "Xi Jinping Meeting with President of Panama", 
             "People attend the Economic and Trade Cooperation Forum"]
         }
}
```

In [38]:
def main(params):
    videos = json.load(open(params['input_json'], 'r',encoding='UTF-8'))['sentences']
    video_caption = {}
    
    ########## 请补全代码 ##########
    
    #print(video_caption)  #每一个视频对应多个描述
    # 构建词汇表
    vocab = build_vocab(video_caption, params)
    itow = {i + 2: w for i, w in enumerate(vocab)}
    wtoi = {w: i + 2 for i, w in enumerate(vocab)}  

    wtoi['<eos>'] = 0
    itow[0] = '<eos>'
    wtoi['<sos>'] = 1
    itow[1] = '<sos>'

    out = {}
    out['ix_to_word'] = itow
    out['word_to_ix'] = wtoi
    out['videos'] = {'train': [], 'val': [], 'test': []}
    videos = json.load(open(params['input_json'], 'r',encoding='UTF-8'))['videos']
    for i in videos:
        out['videos'][i['split']].append(int(i['id']))
    json.dump(out, open(params['info_json'], 'w'))
    json.dump(video_caption, open(params['caption_json'], 'w'))  

In [39]:
params={"word_count_threshold":1,"input_json":"data\Video_info.json","info_json":"data\info.json","caption_json":"data\caption.json"}

In [40]:
main(params)

number of bad words: 746/1310 = 56.95%
number of words in vocab would be 564
number of UNKs: 746/4709 = 15.84%
inserting the special UNK token


 词序列化结果保存在data文件夹中，命名为info.json；描述语句保存在caption.json中

# 3.训练模型

# 算法介绍

![title](image1.png)


模型使用使用两层 RNN，每个具有 512 个隐藏单元。第一个红色的 LSTM 层用于帧序列进行建模，输出隐状态作为第二层绿色 LSTM 层的输入用于对最终的输出词序列进行建模。

训练：

顶层 LSTM 接受帧序列并进行编码，第二层的 LSTM 接受第一层的隐状态 h, 并将其与零填充符相连后编码，该过程不计算损失值。

在所有帧都输出隐状态后，第二层 LSTM 送入起始符<BOS>，使其开始将收到的隐状态解码成单词序列。
    
解码阶段进行训练时，在已知帧序列的隐状态及之前输出的单词的条件下，求预测句子的对数似然。训练目标就是使得下式得到最大值。

![title](image6.png)


整个训练数据集上使用随机梯度下降算法进行优化，从而使 LSTM 学习更合适的隐状态 h。

第二层 LSTM 的输出 z 在词汇库 V 中寻找最大可能性的目标单词 y：

![title](image5.png)


## 算法使用 

## 参数介绍

### Model settings:

model: 选择适用的模型

max_len：描述语句的最大长度

bidirectional：0表示禁用，1表示启用。 编码器/解码器双向

dim_hidden：RNN隐藏层的大小

num_layers：RNN层数

input_dropout_p：dropout强度

rnn_type：RNN 类型，LSTM or GRU

rnn_dropout：对 RNN模型的 dropout

dim_word:词汇表中每个标记的编码大小

### Data input settings:

input_json: 包含视频信息的json文件的路径

info_json: 包含其他信息和词汇的json文件的路径

caption_json: 已处理的视频字幕json文件的路径

feats_dir：包含预处理的fc特征的路径

c3d_feats_dir：C3D特征路径

with_c3d ：是否使用 C3D 特征

cached_tokens：在训练期间计算cider分数的缓存令牌文件。

### 选择参数：

epochs：训练迭代次数

batch_size： 输入数据批大小

grad_clip：梯度阈值（解决梯度爆炸）

self_crit_after：在哪个epoch后开始调整CNN，-1 禁用; 0 从开始就进行微调

dim_vid: 输入特征维数

learning_rate：学习速率


## 命令实例

In [41]:
import json
import os
import numpy as np

import misc.utils as utils
import opts
import torch
import torch.optim as optim
from torch.nn.utils import clip_grad_value_
from dataloader import VideoDataset
from misc.rewards import get_self_critical_reward, init_cider_scorer
from models import DecoderRNN, EncoderRNN, S2VTAttModel, S2VTModel
from torch import nn
from torch.utils.data import DataLoader

In [42]:
opt = {"input_json":"data\\Video_info.json",
       "info_json":"data\\info.json",
       "caption_json":"data\\caption.json",
       "feats_dir":"output\\train-video",
       "c3d_feats_dir":"",
       "with_c3d":0,
       "cached_tokens":"msr-all-idxs",
       "model":"S2VTModel",
       "max_len":28,
       "bidirectional": 0,
       "dim_hidden":512 ,
       "num_layers":1 ,
       "input_dropout_p":0.2 ,
       "rnn_type":"gru",
       "rnn_dropout_p":0.5,
       "dim_word":512,
       "dim_vid":2048,
       "epochs":301,
       "batch_size":1,
       "grad_clip":5,
       "self_crit_after":-1,
       "learning_rate":4e-4,
       "learning_rate_decay_rate":0.8,
       "learning_rate_decay_every":200,
       "optim_alpha":0.9,
       "optim_beta":0.999,
       "optim_epsilon":1e-8,
       "weight_decay":5e-4,
       "save_checkpoint_every":50,
       "checkpoint_path":"save",
       "gpu":0
      }

### （1）准备数据:

In [43]:
opt_json = os.path.join(opt["checkpoint_path"], 'opt_info.json')
if not os.path.isdir(opt["checkpoint_path"]):
    os.mkdir(opt["checkpoint_path"])
with open(opt_json, 'w') as f:
    json.dump(opt, f)
print('save opt details to %s' % (opt_json))

save opt details to save\opt_info.json


In [44]:
dataset = VideoDataset(opt, 'train')  
dataloader = DataLoader(dataset, batch_size=opt["batch_size"], shuffle=True) # DataLoader将数据根据batch size大小、是否shuffle等封装成一个Batch Size大小的Tensor，用于后面的训练。shuffle将元素随机排列
opt["vocab_size"] = dataset.get_vocab_size()

vocab size is  567
number of train videos:  70
number of val videos:  0
number of test videos:  4
load feats from output\train-video
max sequence length in data is 28


### （2）定义网络结构:

### **题5：请补全下面构造S2VT模型的代码**  
请查看`models/S2VTModel.py`路径的[S2VTModel源码](models/S2VTModel.py)，根据`opt`中的参数构造`S2VT`模型

In [45]:
if opt["model"] == 'S2VTModel':
        model = ########## 请补全代码 ##########
elif opt["model"] == "S2VTAttModel":
        encoder = EncoderRNN(
            opt["dim_vid"],
            opt["dim_hidden"],
            bidirectional=opt["bidirectional"],
            input_dropout_p=opt["input_dropout_p"],
            rnn_cell=opt['rnn_type'],
            rnn_dropout_p=opt["rnn_dropout_p"])
        decoder = DecoderRNN(
            opt["vocab_size"],
            opt["max_len"],
            opt["dim_hidden"],
            opt["dim_word"],
            input_dropout_p=opt["input_dropout_p"],
            rnn_cell=opt['rnn_type'],
            rnn_dropout_p=opt["rnn_dropout_p"],
            bidirectional=opt["bidirectional"])       
        model = S2VTAttModel(encoder, decoder)
model = model.cuda()

###  （3）定义损失函数 

In [46]:
crit = utils.LanguageModelCriterion() 
rl_crit = utils.RewardCriterion()



### （4）定义迭代优化算法：Adam

In [47]:
optimizer = optim.Adam(         
    model.parameters(),    #将网络参数放到优化器里    
    lr=opt["learning_rate"],   
    weight_decay=opt["weight_decay"]) 

exp_lr_scheduler = optim.lr_scheduler.StepLR(       
    optimizer, 
    step_size=opt["learning_rate_decay_every"], 
    gamma=opt["learning_rate_decay_rate"]) 

### （5）迭代训练

In [48]:
model.train()

S2VTModel(
  (rnn1): GRU(2048, 512, batch_first=True, dropout=0.5)
  (rnn2): GRU(1024, 512, batch_first=True, dropout=0.5)
  (embedding): Embedding(567, 512)
  (out): Linear(in_features=512, out_features=567, bias=True)
)

In [49]:
for epoch in range(opt["epochs"]): 
    exp_lr_scheduler.step() 
    iteration = 0

    if opt["self_crit_after"] != -1 and epoch >= opt["self_crit_after"]:  
        sc_flag = True 
        init_cider_scorer(opt["cached_tokens"]) 
    else:
        sc_flag = False
        
    # 准备训练数据：    
    
    for data in dataloader: 
        torch.cuda.synchronize() 
        fc_feats = data['fc_feats'].cuda()
        labels = data['labels'].cuda()
        masks = data['masks'].cuda()
        fc_feats = data['fc_feats'].cuda()
        labels = data['labels'].cuda()
        masks = data['masks'].cuda()
            
        optimizer.zero_grad()  #清空优化器里的梯度信息
        if not sc_flag:
            seq_probs, _ = model(fc_feats, labels, 'train')      # 正向传播
            loss = crit(seq_probs, labels[:, 1:], masks[:, 1:])  # 计算损失
        else:
            seq_probs, seq_preds = model(
                fc_feats, mode='inference', opt=opt)
            reward = get_self_critical_reward(model, fc_feats, data,
                                                  seq_preds)
            print(reward.shape)
            loss = rl_crit(seq_probs, seq_preds,
                            torch.from_numpy(reward).float().cuda())
            
        loss.backward()       #反向传播
        clip_grad_value_(model.parameters(), opt['grad_clip'])
            
        optimizer.step()     # 更新参数
        train_loss = loss.item()
        torch.cuda.synchronize()
        iteration += 1
            
        if not sc_flag:
            print("iter %d (epoch %d), train_loss = %.6f" %
                    (iteration, epoch, train_loss))
        else:
            print("iter %d (epoch %d), avg_reward = %.6f" %
                    (iteration, epoch, np.mean(reward[:, 0])))
                
    if epoch % opt["save_checkpoint_every"] == 0:
        model_path = os.path.join(opt["checkpoint_path"],
                                      'model_%d.pth' % (epoch))
        model_info_path = os.path.join(opt["checkpoint_path"],
                                           'model_score.txt')
        torch.save(model.state_dict(), model_path)
        print("model saved to %s" % (model_path))
        with open(model_info_path, 'a') as f:
            f.write("model_%d, loss: %.6f\n" % (epoch, train_loss))



iter 1 (epoch 0), train_loss = 24.901928
iter 2 (epoch 0), train_loss = 23.424854
iter 3 (epoch 0), train_loss = 63.177578
iter 4 (epoch 0), train_loss = 57.602665
iter 5 (epoch 0), train_loss = 63.253719
iter 6 (epoch 0), train_loss = 43.669899
iter 7 (epoch 0), train_loss = 42.917816
iter 8 (epoch 0), train_loss = 38.557159
iter 9 (epoch 0), train_loss = 21.914635
iter 10 (epoch 0), train_loss = 61.325832
iter 11 (epoch 0), train_loss = 58.056610
iter 12 (epoch 0), train_loss = 42.238190
iter 13 (epoch 0), train_loss = 26.240246
iter 14 (epoch 0), train_loss = 38.005875
iter 15 (epoch 0), train_loss = 23.372185
iter 16 (epoch 0), train_loss = 29.681520
iter 17 (epoch 0), train_loss = 27.285995
iter 18 (epoch 0), train_loss = 40.775490
iter 19 (epoch 0), train_loss = 33.239227
iter 20 (epoch 0), train_loss = 29.534119
iter 21 (epoch 0), train_loss = 62.466633
iter 22 (epoch 0), train_loss = 39.408524
iter 23 (epoch 0), train_loss = 19.127930
iter 24 (epoch 0), train_loss = 23.667763
i

iter 59 (epoch 2), train_loss = 27.294922
iter 60 (epoch 2), train_loss = 18.611269
iter 61 (epoch 2), train_loss = 34.032806
iter 62 (epoch 2), train_loss = 48.014523
iter 63 (epoch 2), train_loss = 29.490538
iter 64 (epoch 2), train_loss = 28.841024
iter 65 (epoch 2), train_loss = 35.977760
iter 66 (epoch 2), train_loss = 45.652275
iter 67 (epoch 2), train_loss = 24.517471
iter 68 (epoch 2), train_loss = 29.430014
iter 69 (epoch 2), train_loss = 21.589661
iter 70 (epoch 2), train_loss = 27.370144


 模型以及对应损失值保存在save文件夹中 

# 4.测试

## 算法介绍

将测试视频传入已经训练好的模型中

## 参数介绍

recover_opt：原始参数文件所在的路径

saved_model：测试模型所在的路径

batch_size：输入数据批大小

sample_max：是否在推测阶段采样最大概率以获取下一个单词

## 命令实例 

In [9]:
import json
import os
import argparse
import torch
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader
from models import EncoderRNN, DecoderRNN, S2VTAttModel, S2VTModel
from dataloader import VideoDataset
import misc.utils as utils
from misc.cocoeval import suppress_stdout_stderr, COCOScorer
from pandas.io.json import json_normalize
import nltk.tokenize as tk

### 转换数据格式

In [10]:
def convert_data_to_coco_scorer_format(data_frame):
    gts = {}
    for row in zip(data_frame["caption"], data_frame["video_id"]):
        if row[1] in gts:
            gts[row[1]].append(
                {'image_id': row[1], 'cap_id': len(gts[row[1]]), 'caption': row[0]})
        else:
            gts[row[1]] = []
            gts[row[1]].append(
                {'image_id': row[1], 'cap_id': len(gts[row[1]]), 'caption': row[0]})
    return gts

### 测试，产生预测句子

In [11]:
def test(model, crit, dataset, vocab, opt):
    model.eval()
    loader = DataLoader(dataset, batch_size=opt["batch_size"], shuffle=True)
    gt_dataframe = json_normalize(
        json.load(open(opt["input_json"],'r', encoding='UTF-8'))['sentences'])
    gts = convert_data_to_coco_scorer_format(gt_dataframe)
    
    results = []
    samples = {}
    
    for data in loader:
        
        fc_feats = data['fc_feats'].cuda()
        labels = data['labels'].cuda()
        masks = data['masks'].cuda()
        video_ids = data['video_ids']
      
        with torch.no_grad():
            seq_probs, seq_preds = model(
                fc_feats, mode='inference', opt=opt)

        sents = utils.decode_sequence(vocab, seq_preds)
       
        for k, sent in enumerate(sents):
            video_id = video_ids[k]
            samples[video_id] = [{'image_id': video_id, 'caption': sent}]
    if not os.path.exists(opt["results_path"]):
        os.makedirs(opt["results_path"])
        
    with open(os.path.join(opt["results_path"],
                           opt["model"].split("/")[-1].split('.')[0] + ".json"), 'w') as prediction_results:
        json.dump({"predictions": samples},
                  prediction_results)
      

In [14]:
def main(opt):
    dataset = VideoDataset(opt, "test")
    opt["vocab_size"] = dataset.get_vocab_size()
    opt["seq_length"] = dataset.max_len
    if opt["model"] == 'S2VTModel':
        model = S2VTModel(opt["vocab_size"], opt["max_len"], opt["dim_hidden"], opt["dim_word"],
                          rnn_dropout_p=opt["rnn_dropout_p"]).cuda()
        
    elif opt["model"] == "S2VTAttModel":
        encoder = EncoderRNN(opt["dim_vid"], opt["dim_hidden"], bidirectional=opt["bidirectional"],
                             input_dropout_p=opt["input_dropout_p"], rnn_dropout_p=opt["rnn_dropout_p"])
        decoder = DecoderRNN(opt["vocab_size"], opt["max_len"], opt["dim_hidden"], opt["dim_word"],
                             input_dropout_p=opt["input_dropout_p"],
                             rnn_dropout_p=opt["rnn_dropout_p"], bidirectional=opt["bidirectional"])
        model = S2VTAttModel(encoder, decoder).cuda()
        
    model.load_state_dict(torch.load(opt["saved_model"]))
    crit = utils.LanguageModelCriterion()

    test(model, crit, dataset, dataset.get_vocab(), opt)

### 参数设置

In [15]:
args={"recover_opt":"save/opt_info.json","saved_model":"save/model_300.pth","dump_json":1,"results_path":"results","dump_path":0,"gpu":0,"batch_size":1,"sample_max":1,"temperature":1,"beam_size":1}
opt = json.load(open(args["recover_opt"]))
for k, v in args.items():
    opt[k] = v

### 使用函数

In [16]:
main(opt)

vocab size is  567
number of train videos:  70
number of val videos:  0
number of test videos:  4
load feats from output\train-video
max sequence length in data is 28


  gt_dataframe = json_normalize(


 预测结果储存为json文件，保存在`results`文件夹中

### 部分结果展示

In [17]:
import subprocess
import shutil
import os
import cv2
import ipywidgets as widgets
from IPython.display import display

def show_video(video, dst):
    with open(os.devnull, "w") as ffmpeg_log:
        if os.path.exists(dst):
            shutil.rmtree(dst)                        
        os.makedirs(dst)       
        video_to_frames_command = ["ffmpeg",
                                   '-y',               
                                   '-i', video,        
                                   '-vf', "scale=400:300",  
                                   '-qscale:v', "2",   
                                   '{0}/%06d.jpg'.format(dst)]
        subprocess.call(video_to_frames_command,
                        stdout=ffmpeg_log, stderr=ffmpeg_log)
        imgbox = widgets.Image(format='jpg', height=300, width=400)
        display(imgbox)
        root = dst
        for file in os.listdir(root):
            canvas = cv2.imread(os.path.join(root, file))
            imgbox.value = cv2.imencode('.jpg', canvas)[1].tobytes()

In [18]:
show_video("train-video/G_00173.mp4","frame_save/173")

Image(value=b'', format='jpg', height='300', width='400')

### 模型预测描述： The goods from the overturned truck were scattered all over the place
                        （倾倒卡车的货物散落到各处）

In [19]:
show_video("train-video/G_00171.mp4","frame_save/171")

Image(value=b'', format='jpg', height='300', width='400')

### 模型预测描述： A white van pulled into a door
                        （一辆白色货车驶入门）

# 注： 若使用更多的视频来训练模型，效果会更好！ 