# Video Action Recognition

[*Applied Machine Learning for Health and Fitness*](https://www.apress.com/9781484257715) by Kevin Ashley (Apress, 2020).

[*Video Course*](http://ai-learning.vhx.tv) Need a deep dive? Watch my [*video course*](http://ai-learning.vhx.tv) that complements this book with additional examples and video-walkthroughs. 

[*Web Site*](http://activefitness.ai) for research and supplemental materials.

![](images/ch9/fig_9-1.jpeg)

## Background

Our brain is a super-fast action recognition system that's hard to match. In terms of deep learning our brain routinely does many things to recognize actions, and it works *fast*! It may be years of evolution, the need to identify incoming danger or provide food: each of has a miraculously fast video action recognition engine that just works. Our brain is capable of *normalization* and *transformation* since we recognize actions regardless of viewpoint. Human brain is great at *classification*, telling us what's moving and how, and it can also *predict* what's coming next. Turns out, our knowledge of deep learning for action recognition is getting close, but it's not as good just yet. Especially, it becomes clear when generalizing movements: the brain is far ahead of neural nets in its ability to generalize. Although if data science keeps the same pace of evolution as in the last decade, perhaps AI may get closer to the human brain.

![](images/ch9/fig_9-2.png)

Recognizing actions from videos is key to many industries: sport, surveillance, robotics, health care and many others. For a practical sport data scientist or a coach, action recognition is part of the daily job: coach's eye and experience is trained to do movement analysis.

In biomechanics, we use physical or classical mechanics models to describe movement. This approach worked for many years in sports science, but analytical methods are complicated and as history shows: although movements can be described with classical mechanics, deep learning methods can be more efficient and often demonstrate great precision. To illustrate this point: Kinetics dataset described in this chapter contains 400 activity classes with hundreds of sport activities recognition readily available. You can classify any of those activities with an average level computer, and you really don't need a PhD in biomechanics.

![](images/ch9/fig_9-3.png)

Kinetics dataset has 650,000 video clips, covering hundreds of human actions and totaling close to a terabyte of data. That looks like a lot, but for each sport it covers only a few basic moves. A human ski coach evaluating a skier can narrow it down to dozens of small movements and usually deals with multiple training routines.

Video recognition has been traditionally tough for deep learning because it needs more compute power and storage than most other types of data: that's a lot of power and storage! In the recent years, video recognition methods, models and datasets made a significant progress to the point that they became practical for a sport scientist. As this chapter shows, these methods are also relatively easy to use with the tools, frameworks and models available today. In this chapter, we focus on practical use and methods for video action recognition. You don't need advanced hardware, but a GPU enabled computer is recommended. If you don't have that handy, using an online service like Microsoft Azure or Google Colab that offers free scaleable compute service for data science.

## Video data

> Cinematography is a writing with images in mouvement and with sounds. Robert Bresson, Notes on the Cinematographer

Video classification has been an expensive task because of the need to deal with the video, and video is heavy. In this chapter you'll learn data structures for video, used across most of datasets and models for video recognition.

A single image, or video frame can be represented as a 3D tensor: (width, height, color), color depth having three channels: RGB. A sequence of frames can be represented as a 4D tensor: (frame, width, height, color). For video classification, you will typically deal with sequences, or batches of frames, and the video is represented as a 4D or 5D tensor: (sample, frame, width, height, color).

For example, to read a video into structures ready for deep learning, frameworks provide convenience methods, such as PyTorch's torchvision.io.read\_video. In the following code snippet, the video is loaded as a 4D tensor 255 (frames) x 720 (height) x 1280 (width) x 3 (colors). Notice, that this method also loads audio, although we are not going to use it for action recognition:

```python

import torchvision.io

video_file = 'media/surfing_cutback.mp4'

video, audio, info = torchvision.io.read_video(video_file, pts_unit='sec')

print(video.shape, audio.shape, info)

```

```
Output:

torch.Size([255, 720, 1280, 3]) torch.Size([2, 407552]) {'video_fps': 29.97002997002997, 'audio_fps': 48000}
```

This original video is obviously too big: in order to use it further we'll need to normalize it. This chapter provides sample code that normalizes the video.

## Datasets

> *The relationship between space and time is a mysterious one.\
> *- Carreira, Zimmerman "Quo Vadis? Action Recognition"

Quo Vadis is Latin means: "where are you going?". Although action recognition is achievable from a still frame, it works best when learning from temporal component as well as spatial information.

![](images/ch9/fig_9-4.png)

From the still frame it's not easy to tell whether the person is swimming or running. Perhaps, this ambiguity in action recognition prompted authors of research article on Kinetics video dataset and Two-Stream Inflated 3D ConvNet (I3D) architecture to name the article.

Prior to Kinetics, Sports-1M used to be a breakthrough dataset, in authors' own words:

> To obtain sufﬁcient amount of data needed to train our CNN architectures, we collected a new Sports-1M dataset, which consists of 1 million YouTube videos belonging to a taxonomy of 487 classes of sports.
>
> Andrej Karpathy et al, "Large-scale Video Classiﬁcation with Convolutional Neural Networks"

Historically, deep learning for video recognition focused on activities that were easily available. What source of video data, can a data scientist use without the need to store terabytes of videos? YouTube comes very handy, as well as any other online video service! You'll notice that many of the action recognition datasets use online video services because those videos are typically indexed, can be readily retrieved and often have additional metadata that helps classifying entire videos or even segments. In fact, with massive online video repositories, storing billions of movements and deep learning, we are on the verge of revolution in movement recognition!

Some well-known datasets for human video action sequences include:

-   HMDB 51 is a set of 51 action categories, including facial, body movements and human interaction. This dataset contains some sport activities, but is limited to bike, fencing, baseball and a few others. Included in PyTorch: torchvision.datasets.HMDB51

-   UCF 101 is used in many action recognition scenarios, including human-object interaction, body motion, playing musical instruments and sports. Included in PyTorch: torchvision.datasets.UCF101

-   Kinetics is a large dataset of URL links to video clips that covers human action classes, including sports, human interaction etc. The dataset is available in different sizes: Kinetics 400, 600, 700 and is included in PyTorch: torchvision.datasets.Kinetics400

## Models

While video presents many challenges: computational cost, capturing both spatial and temporal action over long periods of time, it also presents unique opportunities in terms of designing data models. Over the last few years, researchers experimented with various approaches to video action recognition modelling. Methods that prove most effective so far, are using pretrained networks, fusing various streams of data from video, for example motion stream from optical flow and spatial pretrained context [e.g. https://arxiv.org/pdf/1406.2199.pdf].


![Modern models use fusion of context streams: for example temporal and spatial for action recognition](images/ch9/fig_9-5.png)

Some earlier methods tried experimenting and benchmarking model performance with various context streams, for example using different context resolutions with something authors creatively called a "fovea" stream in one research and the main feature learning stream.

> Fovea -- a small depression in the retina of the eye where visual activity is the highest.
>
> Oxford Dictionary

This area of deep learning is still under active research and we may still see state of the art models that outperform existing methods.

## Video Classification QuickStart

### Project 9-1. QuickStart Action Recognition

This project provides a quick start for video classification: the goal is to have a practical sport data scientist quickly started on the human activity recognition. Before we start on this project, let's take a look at the list of human activities we can classify with minimal effort. I'll be using PyTorch here, because it provides video classification datasets and pretrained models out of the box. PyTorch computer vision module, torchvision, contains many models and datasets we can use in sports data science, including classification, semantic segmentation, object detection, person keypoint detection and video classification. Video classification models and datasets included with PyTorch, are what we'll be using for this task to get started quickly.

In PyTorch video classification models are trained with Kinetics 400 dataset. Although not all of these human activities are sports related, I put together a helper in utils.kinetics, conveniently, it provides a list of sport related activities:

In [1]:
from utils.kinetics import kinetics
categories = kinetics.categories()
classes = kinetics.classes()
sports = kinetics.sport_categories()

count = 0
for key in categories.keys():
    if key in sports:
        print(key)
        for label in categories[key]:
            count+=1
            print("\t{}".format(label))
print(f'Sport activities labels: {count}')      
      

athletics - jumping
	high jump
	hurdling
	long jump
	parkour
	pole vault
	triple jump
athletics - throwing + launching
	archery
	catching or throwing frisbee
	disc golfing
	hammer throw
	javelin throw
	shot put
	throwing axe
	throwing ball
	throwing discus
ball sports
	bowling
	catching or throwing baseball
	catching or throwing softball
	dodgeball
	dribbling basketball
	dunking basketball
	golf chipping
	golf driving
	golf putting
	hitting baseball
	hurling (sport)
	juggling soccer ball
	kicking field goal
	kicking soccer ball
	passing American football (in game)
	passing American football (not in game)
	playing basketball
	playing cricket
	playing kickball
	playing squash or racquetball
	playing tennis
	playing volleyball
	shooting basketball
	shooting goal (soccer)
	shot put
golf
	golf chipping
	golf driving
	golf putting
gymnastics
	bouncing on trampoline
	cartwheeling
	gymnastics tumbling
	somersaulting
	vault
heights
	abseiling
	bungee jumping
	climbing a rope
	climbing ladder
	c

**Note for activity granularity:** the trend with activity recognition is getting even more granular. For example, in golf Kinetics 400 classifies chipping, driving and putting. For swimming: backstroke, breast stroke, butterfly etc.

So, from several video classification models available in torchvision pretrained on Kinetics 400 dataset, we should be able to get 130+ sports related actions classified. To quick start our development, we will jump start action recognition with pre-trained models available in PyTorch torchvision. Let's start by importing the required modules:

In [1]:
import torch
import torchvision
import torchvision.models as models    

# check if cuda is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


Then, getting an appropriate model trained on Kinetics 400. Currently, PyTorch supports three models out of the box: ResNet 3D, ResNet Mixed Convolution and ResNet (2+1)D. I instantiated ResNet 3D (r3d\_18) and commented out two other models. The important thing of course is pretrained=True flag, that saves us downloading all the videos for Kinetics 400 dataset to train the model!

In [2]:
#model = models.video.r3d_18(pretrained=True) 
#model = models.video.mc3_18(pretrained=True) 
model = models.video.r2plus1d_18(pretrained=True)
model.eval() 

VideoResNet(
  (stem): R2Plus1dStem(
    (0): Conv3d(3, 45, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
    (1): BatchNorm3d(45, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): Conv3d(45, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
    (4): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
  )
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Sequential(
        (0): Conv2Plus1D(
          (0): Conv3d(64, 144, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (1): BatchNorm3d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Conv3d(144, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        )
        (1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=Tru

In [4]:
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(pytorch_total_params)

31505325


Thanks to PyTorch magic, the pretrained model gets downloaded automatically, a huge time saver! Training on Kinetics 400 dataset requires a massive number of videos downloaded. So, having a pretrained video classification model in PyTorch is a great starter for a practical sport data scientist.

**Practical Tip:** Having a pretrained model for Kinetics in PyTorch torchvision offers a big advantage from the practical standpoint. Downloading datasets and videos to train video classification models takes a lot of space and compute power!

Next, we need to do normalization. For Kinetics, height and width are normalized to 112, and the mean = [0.43216, 0.394666, 0.37645] and std = [0.22803, 0.22145, 0.216989]:

In [4]:
# Normalization: Kinetics 400

mean = [0.43216, 0.394666, 0.37645]  
std = [0.22803, 0.22145, 0.216989] 

def normalize(video): 
    return video.permute(3, 0, 1, 2).to(torch.float32) / 255

def resize(video, size): 
    return torch.nn.functional.interpolate(video, size=size, scale_factor=None, mode='bilinear', align_corners=False)

def crop(video, output_size): 
    # center crop    
    h, w = video.shape[-2:] 
    th, tw = output_size 
    i = int(round((h - th) / 2.)) 
    j = int(round((w - tw) / 2.)) 
    return video[..., i:(i + th), j:(j + tw)]

def normalize_base(video, mean, std): 
    shape = (-1,) + (1,) * (video.dim() - 1) 
    mean = torch.as_tensor(mean).reshape(shape) 
    std = torch.as_tensor(std).reshape(shape) 
    return (video - mean) / std

Next, we will use torchvision.io method to read a video file and show shape and some other useful information about the video we use as a source:

In [5]:
import torchvision.io 
video_file = 'media/surfing_cutback.mp4'
video, audio, info = torchvision.io.read_video(video_file, pts_unit='sec') 
print(video.shape, audio.shape, info)

torch.Size([255, 720, 1280, 3]) torch.Size([2, 407552]) {'video_fps': 29.97002997002997, 'audio_fps': 48000}


As you can see, the original video has 255 frames and 720p resolution, that is relatively large for our model. We need to normalize the video before giving it to the model:

In [7]:
video = normalize(video) 
video = resize(video,(128, 171)) 
video = crop(video,(112, 112)) 
video = normalize_base(video, mean=mean, std=std)
shape = video.shape
print(f'frames {shape[0]}, size {shape[1]} {shape[2]}') 

frames 3, size 255 112


Much better! After normalization, the video is much smaller: only 255x112 and it contains only 3 frames. If you have a GPU enabled device with CUDA, you can accelerate the process by moving both model and video tensor to the CUDA enabled device:

In [8]:
# make use of accelerated CUDA if available
if torch.cuda.is_available():
    model.cuda()
    video = video.cuda() 

Now comes the magic of applying our pretrained model and giving it a surfer video. The result is an array of scored classes (activities). We can print the best score, which is the number of the class in the list of activities of Kinetics dataset. This may take some time if you have a CPU only, depending on your environment, so be patient:

In [None]:
# score the video
score = model(video.unsqueeze(0)) 
# get prediction with max score
prediction = score.argmax() 
print(prediction)

The resulting index is not very meaningful, so let's get back the actual class name it represents, by using our utility script. And it turns out to be 'surfing water', the class Kinetics model was trained with to detect surfing action. Our predicted result is correct!

In [None]:
from utils.kinetics import kinetics
classes = kinetics.classes()
print(classes[prediction.item()])

In this example we used a custom video file from a consumer grade 720p resolution video camera. We used a PyTorch pretrained model trained on Kinetics dataset for video classification of 400 activities, of which more than more than a hundred are sports related. We normalized the video and classified surfing correctly on the video.

## Loading videos for classification

PyTorch includes a number of modules simplifying video classification. In the previous project you already explored an introduction to video classification, based on a pretrained models, included in torchvision. We also used torchvision.io.read\_video method to load videos in a convenient structure of tensors that include video frames, audio and relevant video information. In the following project we'll take it further and will do some practical video loading and model training, as well as transfer learning for video classification.

### Project 9-2. Loading videos for classifier training

In this project I'll show you how to use video dataset modules, such as Kinetics400 and DataLoader to visualize videos and prepare them for training. Kinetics folder structure follows a common convention that includes train/test/validation folders and videos split into classes of actions we need to recognize. Since datasets, such as Kinetics are based on indexed online videos, there're many scripts out there that simplify loading videos for training and structuring them in folders. For now, we'll define the base directory of our dataset:

In [10]:
import torch
import torch.nn as nn
import torchvision
import torchvision.models as models
from torch.utils.data import DataLoader as DataLoader
from torchvision import transforms
from torchvision.datasets.kinetics import Kinetics400
from torchvision.datasets.samplers import DistributedSampler, UniformClipSampler, RandomClipSampler
import matplotlib.pyplot as plt
from pathlib import Path

Path.ls = lambda x: [o.name for o in x.iterdir()]
from torchvision.io.video import read_video
from functools import partial as partial
read_video = partial(read_video, pts_unit='sec')
torchvision.io.read_video = partial(torchvision.io.read_video, pts_unit = 'sec')

In [11]:
base_dir = Path('data/kinetics400/')
data_dir = base_dir/'dataset'

In [None]:
!tree {data_dir/'train'}

Conveniently, as part of torchvision.datasets, PyTorch includes Kinetics400 dataset that serves as a cookie cutter for our project. Internally, video datasets use VideoClips object to store video clips data:

In [13]:
data = torchvision.datasets.Kinetics400(
            data_dir/'train',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            extensions=('mp4',),
            num_workers=0
        )

100%|███████████████████████████████████████████████████████████████████████| 10/10 [00:34<00:00,  3.49s/it]


**Note:** Although you can and should take advantage of the multiprocessing nature of datasets, especially in the production environment, on some systems you may get an error, num_workers = 0 makes sure you use dataset single threaded.

According to this constructor above, each video clip loaded with our dataset should be a 4D tensor with the shape (frames, height, width, channels), in our case 32 frames, RGB video, note that Kinetics doesn't require all clips to be of the same height/width:

In [14]:
print((data[0][0]).shape)

torch.Size([32, 480, 272, 3])


### Visualizing dataset

Sometimes, it may be handy to visualize the entire dataset catalog as a table, summarizing the number of frames. The helper function to_dataframe loads the entire video catalog into Pandas DataFrame and displays the content:

In [28]:
import pandas as pd
from utils.video_classification.helpers import to_dataframe

to_dataframe(data)

Length: 35271


Unnamed: 0,filepath,frames,fps,clips
0,data\kinetics400\dataset\train\playing_tennis\...,300,30.00000,269
1,data\kinetics400\dataset\train\playing_tennis\...,300,29.97003,269
2,data\kinetics400\dataset\train\playing_tennis\...,300,29.97003,269
3,data\kinetics400\dataset\train\playing_tennis\...,300,30.00000,269
4,data\kinetics400\dataset\train\playing_tennis\...,300,30.00000,269
...,...,...,...,...
145,data\kinetics400\dataset\train\surfing_water\Y...,250,25.00000,219
146,data\kinetics400\dataset\train\surfing_water\Z...,300,29.97003,269
147,data\kinetics400\dataset\train\surfing_water\_...,178,29.97003,147
148,data\kinetics400\dataset\train\surfing_water\a...,119,15.00000,88


Let’s say we want to display the size of a video in the dataset:

In [30]:
VIDEO_NUMBER = 130
video_table = to_dataframe(data)
video_info = video_table['filepath'][VIDEO_NUMBER]

Length: 35271


With notebook IPython.display Video helper we can also show the video embedded in the notebook, but keep in mind that setting embed=True while displaying the video may significantly increase the size of your notebook:

In [29]:
from IPython.display import Video
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Video(video_info, width=400, embed=False)

So instead of embedding the video, it may be sufficient to just visualize the first and last frames:

In [None]:
def show_clip_start_end(f):
    last = len(f)
    plt.imshow(f[0])
    plt.title(f'frame: 1')
    plt.axis('off')
    plt.show()
    plt.imshow(f[last-1])
    plt.title(f'frame: {last}')
    plt.axis('off')
    plt.show()

show_clip_start_end(data[0][0])

## Video normalization

As with most of the data, before training our model, video needs to be normalized for video classification models included in torchvision. This involves getting image data in the range \[0,1\] and normalizing with standard deviation and the mean provided with the model:

In [43]:
import utils.video_classification.transforms as T

t = torchvision.transforms.Compose([
        T.ToFloatTensorInZeroOne(),
        T.Resize((128, 171)),
        T.RandomHorizontalFlip(),
        T.Normalize(mean=[0.43216, 0.394666, 0.37645],
                            std=[0.22803, 0.22145, 0.216989]),
        T.RandomCrop((112, 112))
    ])

Once we've defined the transform, we can pass it to the Kinetics400 dataset:

In [44]:
train_data = torchvision.datasets.Kinetics400(
            data_dir/'train',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            transform=t,
            extensions=('mp4',),
            num_workers=0
        )

100%|███████████████████████████████████████████████████████████████████████| 10/10 [00:39<00:00,  3.97s/it]


DataLoader class in PyTorch provides many useful features and makes it easy to use from Python, including: iterable datasets, automatic batching, memory pinning, sampling and data loading order customization etc.

## Finding learning rate

> *Never let formal education get in the way of your learning.\
> *--Mark Twain

Learning rate, as a hyperparameter for training neural networks is important: if you make learning rate too small, the model will likely converge too slowly.

**Mysterious constant:** The so-called Karpathy constant defines the best learning rate for Adam as 3e-4. The author of the famous tweet in data science, Andrej himself in the response to his own tweet says that this was a joke. Nevertheless, the constant made it to Urban Dictionary and many data science blogs.

We will not take this for granted of course and will use sound theory to find the best learning rate. In practice, a large learning rate may fail to reach model convergence. As an illustration, notice that by making learning rate too large for gradient descent, the model will never reach its minimum.

![](images/ch9/fig_9-7.png)

To deal with this problem, a paper by Leslie N. Smith [https://arxiv.org/pdf/1506.01186.pdf] was published that proposed a method to optimize finding learning rates. As a result, many frameworks, including fastai and PyTorch now include learning rate finder module. For PyTorch, you can use torch_lr_finder module by installing it with pip install torch-lr-finder and then use it in the code with:

In [51]:
from torch_lr_finder import LRFinder

Getting dataset ready for learning rate finder:

In [49]:
from utils.video_classification.first_clip_sampler import FirstClipSampler
from torch.utils.data.dataloader import default_collate

def collate_fn(batch):
    # remove audio from the batch
    batch = [(d[0], d[2]) for d in batch]
    return default_collate(batch)

train_sampler = FirstClipSampler(train_data.video_clips, 2)
train_dl = DataLoader(train_data, batch_size=4, sampler=train_sampler, collate_fn=collate_fn, pin_memory=True)
x,y = next(iter(train_dl))
x.shape, y.shape

(torch.Size([4, 3, 32, 112, 112]), torch.Size([4]))

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6, weight_decay=1e-2)
# if you are getting memory problems running this, 
# try reducing DataLoader batch_size above to 16 or even 4
lr_finder = LRFinder(model, optimizer, criterion,device=device)
lr_finder.range_test(train_dl, end_lr=10, num_iter=90)
lr_finder.plot()
lr_finder.reset()

In case of video action recognition, with the size of the data and large differences in training times for video data, it is recommended to use a proper learning rate, which is often in the middle of the descending loss curve. The module plots the loss curve, and the optimal learning rate from the chart below is somewhere near value lr = 1e-2:

![](images/ch9/fig_9-8.png)

Optimal learning rate is found around the middle of descending loss curve, on this figure around 10\^-2.

## Training the model

Training the model for video action recognition in PyTorch follows the same principles as for image classifier, but since video classification functionality is relatively new in PyTorch, it's worth including a small example in this chapter.

### Project 9-3. Video Recognition Model Training

To start, let's create two datasets, for training and validation, based on built-in Kinetics object. The idea here is to take advantage of built-in objects that PyTorch offers. I use the same normalizing video transformation T, already used in previous examples. On some systems you can get a significant speed improvement if you set num_workers > 0 , but on my system I had to be conservative, so I keep it at zero (basically, it means don't take advantage of parallelization):

In [None]:
train_data = torchvision.datasets.Kinetics400(
            data_dir/'train',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            transform=t,
            extensions=('mp4',),
            num_workers=0
        )

valid_data = torchvision.datasets.Kinetics400(
            data_dir/'valid',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            transform=t,
            extensions=('mp4',),
            num_workers=0
        )

PyTorch allows using familiar DataLoaders with video data, and for video data PyTorch includes VideoClips class used for enumerating clips in the video and also sampling clips in the video while loading. FirstClipSampler in the below example used video\_clips property from the dataset to sample a specified number of clips in the video:

In [None]:
train_sampler = FirstClipSampler(train_data.video_clips, 2)
train_dl = DataLoader(train_data, 
                      batch_size=4, 
                      sampler=train_sampler, 
                      collate_fn=collate_fn, 
                      pin_memory=True)
valid_sampler = FirstClipSampler(valid_data.video_clips, 2)
valid_dl = DataLoader(valid_data, 
                      batch_size=4, 
                      sampler=valid_sampler, 
                      collate_fn=collate_fn, pin_memory=True)

Loading and renormalizing video data can take a really long time, so you may want to save the normalized dataset in cache directory:

In [None]:
import os
cache_dir = data_dir/'.cache'
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)
cache_dir.ls()

torch.save(train_data, f'{cache_dir}/train')
torch.save(valid_data, f'{cache_dir}/valid')
train_data = torch.load(cache_dir/'train')
valid_data = torch.load(cache_dir/'valid') 

Next, you initialize the model with hyper-parameters, including the learning rate obtained earlier. Note that since we'll be training the model, we instantiate it without weights (pretrained=False or omitted):

In [None]:
import torch
import torchvision
import torchvision.models as models

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

model = models.video.r2plus1d_18()
model.cuda()
lr = 1e-2
criterion = nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters(), lr)
lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=5e-1, steps_per_epoch=len(train_dl), epochs=10)
metrics_dir = cache_dir/'train-metrics'

CrossEntropyLoss can be used for training classification problems and Adam optimizer (same as we used finding the learning rate). Next, we can train the model, in the example below I chose 10 epochs:

In [None]:
import sys
import time
import datetime
from utils.video_classification.train import train_one_epoch, evaluate

start_time = time.time()
 
for epoch in range(10):
    train_one_epoch(model, 
                    criterion, 
                    optim, 
                    lr_scheduler, 
                    train_dl, device, 
                    epoch, print_freq=100)
    evaluate(model, 
             criterion, 
             valid_dl, 
             device)

total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print('Training time {}'.format(total_time_str))

You can also save the model weights once it's trained:

In [None]:
SAVED_MODEL_PATH = './videoresnet_action.pth'
torch.save(model.state_dict(), SAVED_MODEL_PATH)

## Summary

In this chapter we covered practical methods and tools for video action recognition and classification. We discussed data structures for loading, normalizing and storing videos, datasets for sports action classification, such as Kinetics, and deep learning models. Using readily available pre-trained models, we can classify hundreds of sport actions and train the models to recognize new activities. For a sport data scientist, this chapter provides practical examples for deep learning, movement analysis, action recognition on any video.

Although video action recognition is becoming more usable today, and made progress in thousands of classifications, we are still far from the goals of generalized action recognition. That means, as a sport data scientist, you are still left with a lot of work to apply video recognition in the field. Is this the right time to make video action recognition a part of your toolbox? With practical examples and notebooks accompanying this chapter, I think that this is the right time for coaches and sport scientists to start using these methods in everyday sport data science. 

## Reference

[*Video Course*](http://ai-learning.vhx.tv) Need a deep dive? Watch my [*video course*](http://ai-learning.vhx.tv) that complements this book with additional examples and video-walkthroughs. 

[*Web Site*](http://activefitness.ai) for research and supplemental materials.