In [1]:
!pip install mxnet-cu112
!sudo apt install libnccl2

Collecting mxnet-cu112
  Downloading mxnet_cu112-1.9.1-py3-none-manylinux2014_x86_64.whl (499.4 MB)
[K     |████████████████████████████████| 499.4 MB 794 bytes/s a 0:00:011
[?25hCollecting graphviz<0.9.0,>=0.8.1
  Using cached graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet-cu112
Successfully installed graphviz-0.8.4 mxnet-cu112-1.9.1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  libnccl2
0 upgraded, 1 newly installed, 0 to remove and 177 not upgraded.
Need to get 97.5 MB of archives.
After this operation, 268 MB of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  libnccl2 2.15.5-1+cuda11.8 [97.5 MB]
Fetched 97.5 MB in 5s (18.2 MB/s)    [0m3m[33m[33m[33m[33m[33m[33m[33m[33m[33m
debconf: delaying package configuration, since apt-utils is not installed

7[0;23r8[1A

In [2]:
!pip install --upgrade gluoncv

Collecting gluoncv
  Using cached gluoncv-0.10.5.post0-py2.py3-none-any.whl (1.3 MB)
Collecting opencv-python
  Using cached opencv_python-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (61.8 MB)
Collecting portalocker
  Using cached portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Collecting yacs
  Downloading yacs-0.1.8-py3-none-any.whl (14 kB)
Collecting autocfg
  Using cached autocfg-0.0.8-py3-none-any.whl (13 kB)
Installing collected packages: opencv-python, portalocker, yacs, autocfg, gluoncv
Successfully installed autocfg-0.0.8 gluoncv-0.10.5.post0 opencv-python-4.7.0.72 portalocker-2.7.0 yacs-0.1.8


# 6. Dive Deep into Training SlowFast mdoels on Kinetcis400

This is a video action recognition tutorial using Gluon CV toolkit, a step-by-step example.
The readers should have basic knowledge of deep learning and should be familiar with Gluon API.
New users may first go through `A 60-minute Gluon Crash Course <http://gluon-crash-course.mxnet.io/>`_.
You can `Start Training Now`_ or `Dive into Deep`_.

## Start Training Now

<div class="alert alert-info"><h4>Note</h4><p>Feel free to skip the tutorial because the training script is self-complete and ready to launch.

    :download:`Download Full Python Script: train_recognizer.py<../../../scripts/action-recognition/train_recognizer.py>`

    For more training command options, please run ``python train_recognizer.py -h``
    Please checkout the `model_zoo <../model_zoo/index.html#action_recognition>`_ for training commands of reproducing the pretrained model.</p></div>


### Network Structure

First, let's import the necessary libraries into python.




In [3]:
!pip install mxnet-cu101

Collecting mxnet-cu101
  Downloading mxnet_cu101-1.9.1-py3-none-manylinux2014_x86_64.whl (360.0 MB)
[K     |████████████████████████████████| 360.0 MB 29 kB/s /s eta 0:00:01
Installing collected packages: mxnet-cu101
Successfully installed mxnet-cu101-1.9.1


In [1]:
from __future__ import division

import argparse, time, logging, os, sys, math

import numpy as np
import mxnet as mx
import gluoncv as gcv
from mxnet import gluon, nd, init, context
from mxnet import autograd as ag
from mxnet.gluon import nn
from mxnet.gluon.data.vision import transforms

from gluoncv.data.transforms import video

from gluoncv.data import Kinetics400
from gluoncv.model_zoo import get_model
from gluoncv.utils import makedirs, LRSequential, LRScheduler, split_and_load, TrainingHistory

Here we pick a widely adopted model, ``SlowFast``, for the tutorial.
`SlowFast <https://arxiv.org/abs/1812.03982>`_ is a new 3D video
classification model, aiming for best trade-off between accuracy and efficiency.
It proposes two branches, fast branch and slow branch, to handle different aspects in a video.
Fast branch is to capture motion dynamics by using many but small video frames.
Slow branch is to capture fine apperance details by using few but large video frames.
Features from two branches are combined using lateral connections.



In [2]:
# number of GPUs to use
num_gpus = 1
ctx = [mx.gpu(i) for i in range(num_gpus)]

# Get the model slowfast_4x16_resnet50_kinetics400 with 400 output classes, without pre-trained weights
net = get_model(name='slowfast_4x16_resnet50_kinetics400', nclass=4)
net.collect_params().reset_ctx(ctx)
print(net)

: 

: 

### Data Augmentation and Data Loader

Data augmentation for video is different from image. For example, if you
want to randomly crop a video sequence, you need to make sure all the video
frames in this sequence undergo the same cropping process. We provide a
new set of transformation functions, working with multiple images.
Please checkout the `video.py <../../../gluoncv/data/transforms/video.py>`_ for more details.
Most video data augmentation strategies used here are introduced in [Wang15]_.



In [1]:
transform_train = transforms.Compose([
    # Fix the input video frames size as 256×340 and randomly sample the cropping width and height from
    # {256,224,192,168}. After that, resize the cropped regions to 224 × 224.
    video.VideoCenterCrop(size=(224, 224)),
    # Randomly flip the video frames horizontally
    video.VideoRandomHorizontalFlip(),
    # Transpose the video frames from height*width*num_channels to num_channels*height*width
    # and map values from [0, 255] to [0,1]
    video.VideoToTensor(),
    # Normalize the video frames with mean and standard deviation calculated across all images
    video.VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

NameError: name 'transforms' is not defined

With the transform functions, we can define data loaders for our
training datasets.



In [None]:
# Batch Size for Each GP
per_device_batch_size = 10
# Number of data loader workers
num_workers = 0
# Calculate effective total batch size
batch_size = per_device_batch_size * num_gpus


# Set train=True for training the model.
# ``new_length`` indicates the number of frames we will cover.
# For SlowFast network, we evenly sample 32 frames for the fast branch and 4 frames for the slow branch.
# This leads to the actual input length of 36 video frames.
train_dataset = Kinetics400(train=True, new_length=64, slowfast=True, transform=transform_train, root='/home/irteam/dcloud-global-dir/Ahpuh/swim_dataset/rawframes_train', setting='/home/irteam/dcloud-global-dir/Ahpuh/swim_dataset/swim_train_list_rawframes.txt')
print('Load %d training samples.' % len(train_dataset))
train_data = gluon.data.DataLoader(train_dataset, batch_size=batch_size,
                                   shuffle=True, num_workers=num_workers)

Load 185 training samples.


### Optimizer, Loss and Metric



In [6]:
lr_decay = 0.1
warmup_epoch = 34
total_epoch = 196
num_batches = len(train_data)
lr_scheduler = LRSequential([
    LRScheduler('linear', base_lr=0.01, target_lr=0.1,
                nepochs=warmup_epoch, iters_per_epoch=num_batches),
    LRScheduler('cosine', base_lr=0.1, target_lr=0,
                nepochs=total_epoch - warmup_epoch,
                iters_per_epoch=num_batches,
                step_factor=lr_decay, power=2)
])

# Stochastic gradient descent
optimizer = 'sgd'
# Set parameters
optimizer_params = {'learning_rate': 0.01, 'wd': 0.0001, 'momentum': 0.9}
optimizer_params['lr_scheduler'] = lr_scheduler

# Define our trainer for net
trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params)

In order to optimize our model, we need a loss function.
For classification tasks, we usually use softmax cross entropy as the
loss function.



In [7]:
loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()

For simplicity, we use accuracy as the metric to monitor our training
process. Besides, we record metric values, and will print them at the
end of training.



In [8]:
train_metric = mx.metric.Accuracy()
train_history = TrainingHistory(['training-acc'])

### Training

After all the preparations, we can finally start training!
Following is the script.

<div class="alert alert-info"><h4>Note</h4><p>In order to finish the tutorial quickly, we only train for 0 epoch on a tiny subset of Kinetics400,
  and 100 iterations per epoch. In your experiments, we recommend setting ``epochs=100`` for the full Kinetics400 dataset.</p></div>



In [9]:
epochs = 100

for epoch in range(epochs):
    tic = time.time()
    train_metric.reset()
    train_loss = 0

    # Loop through each batch of training data
    for i, batch in enumerate(train_data):
        # Extract data and label
        data = split_and_load(batch[0], ctx_list=ctx, batch_axis=0, even_split=False)
        label = split_and_load(batch[1], ctx_list=ctx, batch_axis=0, even_split=False)

        # AutoGrad
        with ag.record():
            output = []
            for _, X in enumerate(data):
                X = X.reshape((-1,) + X.shape[2:])
                pred = net(X)
                output.append(pred)
            loss = [loss_fn(yhat, y) for yhat, y in zip(output, label)]

        # Backpropagation
        for l in loss:
            l.backward()

        # Optimize
        trainer.step(batch_size)

        # Update metrics
        train_loss += sum([l.mean().asscalar() for l in loss])
        train_metric.update(label, output)

        if i == 100:
            break

    name, acc = train_metric.get()

    # Update history and print metrics
    train_history.update([acc])
    print('[Epoch %d] train=%f loss=%f time: %f' %
        (epoch, acc, train_loss / (i+1), time.time()-tic))

# We can plot the metric scores with:
train_history.plot()

[Epoch 0] train=0.513514 loss=1.230537 time: 104.334068
[Epoch 1] train=0.524324 loss=1.547367 time: 85.579333
[Epoch 2] train=0.513514 loss=3.150768 time: 82.320675
[Epoch 3] train=0.518919 loss=3.554893 time: 82.714502
[Epoch 4] train=0.524324 loss=3.568263 time: 85.405434
[Epoch 5] train=0.529730 loss=1.863569 time: 83.962964
[Epoch 6] train=0.540541 loss=2.164887 time: 82.732981
[Epoch 7] train=0.524324 loss=2.897637 time: 81.822337
[Epoch 8] train=0.572973 loss=2.644553 time: 83.196638
[Epoch 9] train=0.578378 loss=1.702112 time: 81.053108
[Epoch 10] train=0.648649 loss=1.144696 time: 70.741769
[Epoch 11] train=0.637838 loss=1.272942 time: 80.186999
[Epoch 12] train=0.600000 loss=1.379443 time: 76.679555
[Epoch 13] train=0.627027 loss=0.955011 time: 71.060725
[Epoch 14] train=0.648649 loss=1.097545 time: 72.599410
[Epoch 15] train=0.627027 loss=1.369598 time: 78.906116
[Epoch 16] train=0.686486 loss=1.176971 time: 68.645279
[Epoch 17] train=0.600000 loss=1.458098 time: 69.332837
[

In [1]:
# save model
file_name = "/home/irteam/dcloud-global-dir/Ahpuh/net.params"
net.save_parameters(file_name)

NameError: name 'net' is not defined

In [43]:
file_name = "/home/irteam/dcloud-global-dir/Ahpuh/net.params"

new_net = get_model(name='slowfast_4x16_resnet50_kinetics400', nclass=4)
new_net.load_parameters(file_name)

In [44]:
from gluoncv.utils.filesystem import try_import_decord
decord = try_import_decord()

vr = decord.VideoReader('/home/irteam/dcloud-global-dir/Ahpuh/swim_dataset/train/else/_WD4C5G7GUE_000103_000113.mp4')
fast_frame_id_list = range(0, 64, 2)
slow_frame_id_list = range(0, 64, 16)
frame_id_list = list(fast_frame_id_list) + list(slow_frame_id_list)
video_data = vr.get_batch(frame_id_list).asnumpy()
clip_input = [video_data[vid, :, :, :] for vid, _ in enumerate(frame_id_list)]


ImportError: Decord is required, you can install by `pip install decord --user`         (note that this is unofficial PYPI package).

In [46]:
import cv2

input_file = '/home/irteam/dcloud-global-dir/Ahpuh/swim_dataset/train/else/_WD4C5G7GUE_000103_000113.mp4'
cap = cv2.VideoCapture(input_file)

result = []
while True:
    ret, frame = cap.read()
    if not ret: break
    
    # Resize the frame
    print(frame.shape)
    resized_frame = cv2.resize(frame, (224, 224))
    result.append(resized_frame)

cap.release()
cv2.destroyAllWindows()
result = np.array(result)

fast_frame_id_list = range(0, 64, 2)
slow_frame_id_list = range(0, 64, 16)
frame_id_list = list(fast_frame_id_list) + list(slow_frame_id_list)
clip_input = [result[vid, :, :, :] for vid, _ in enumerate(frame_id_list)]

transform_train = transforms.Compose([
    # Fix the input video frames size as 256×340 and randomly sample the cropping width and height from
    # {256,224,192,168}. After that, resize the cropped regions to 224 × 224.
    #video.ShortSideRescale(224),
    # Randomly flip the video frames horizontally
    #video.VideoRandomHorizontalFlip(),
    # Transpose the video frames from height*width*num_channels to num_channels*height*width
    # and map values from [0, 255] to [0,1]
    video.VideoToTensor(),
    # Normalize the video frames with mean and standard deviation calculated across all images
    video.VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# transform_fn = video.VideoGroupValTransform(size=224, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
clip_input = transform_train(clip_input)
clip_input = np.stack(clip_input, axis=0)
clip_input = clip_input.reshape((-1,) + (36, 3, 224, 224))
clip_input = np.transpose(clip_input, (0, 2, 1, 3, 4))
print('Video data is downloaded and preprocessed.')

(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280, 3)
(720, 1280

In [47]:
pred = new_net(nd.array(clip_input))
nd.topk(pred, k=4)[0]


[2. 0. 1. 3.]
<NDArray 4 @cpu(0)>

In [169]:
from PIL import Image
for i in range(36):
    img = Image.fromarray(np.transpose(clip_input, (0, 2, 3, 4, 1))[0][i], 'RGB')
    # img = Image.fromarray(clip_input[i], 'RGB')
    img.save('/home/irteam/didwldn3032-dcloud-dir/didwldn3032/2023-1-SCS4031-Ahpuh/test_imgs/img%s.jpg'%i)
    print("finish img saving")

finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving
finish img saving


In [32]:
label = {0:'diving', 1:'drown', 2:'falldown', 3:'else'}
topK = 3
ind = nd.topk(pred, k=topK)[0].astype('int')
print('The input video clip is classified to be')
for i in range(topK):
    print('probability %.3f.'%
          (nd.softmax(pred)[0][ind[i]].asscalar()))

The input video clip is classified to be
probability 1.000.
probability 0.000.
probability 0.000.


Due to the tiny subset, the accuracy number is quite low.
You can `Start Training Now`_ on the full Kinetics400 dataset.

### References

.. [Wang15] Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. \
    "Towards Good Practices for Very Deep Two-Stream ConvNets." \
    arXiv preprint arXiv:1507.02159 (2015).

