<a href="https://colab.research.google.com/github/Guillem96/activity-recognition/blob/master/notebooks/Kinetics400%20Obtaining%20Clip%20Candidates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reproducing Kinetics Obtaining candidate clips

In this notebook, we review one of the steps that the authors of Kinetics dataset did in order to gather the huge amount of human activity *ubiased* clips.

In the kinetics-400 they describe the collection process: 

- how candidate videos were obtained from YouTube, 
- and then the postprocessing pipeline that was used to select the candidates

In the following cells, we are going to focus on the first point (obtaining candidate clips).

If you are interessed in the data collection process, I recommend reading the sction 3 of the [kinetics-400 paper](https://arxiv.org/pdf/1705.06950.pdf).

`TL;DR`

Paper citation:

> clips for each class were obtained by first
searching on YouTube for candidates, and then using Amazon Mechanical Turkers (AMT) to decide if the clip contains the action or not. Three or more confirmations (out of
five) were required before a clip was accepted. The dataset
was de-duped, by checking that only one clip is taken from
each video, and that clips do not contain common video
material. Finally, classes were checked for overlap and denoised.

## Preparing the environment

1. Install custom activity recognition package for PyTorch.
2. Install `youtube-dl` to download a video sample from youtube.

In [None]:
!rm -rf activity-recognition
!git clone "https://github.com/Guillem96/activity-recognition"
!cd activity-recognition && pip install .
!wget https://yt-dl.org/downloads/latest/youtube-dl -O /usr/local/bin/youtube-dl
!chmod a+rx /usr/local/bin/youtube-dl

In [None]:
#@markdown `import *`
import tqdm.auto as tqdm
from pathlib import Path
from base64 import b64encode

import torch
import torchvision
import torchvision.transforms as T

import ar
import ar.transforms as VT

import matplotlib.pyplot as plt
from IPython.display import HTML

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
#@markdown Display Video Function
def display_video(video, title=''):
    from matplotlib import animation
    from IPython.display import HTML, display

    fig = plt.figure()
    plt.axis('off')
    im = plt.imshow(video[0,:,:,:])

    plt.close() # this is required to not display the generated image

    def init():
        im.set_data(video[0,:,:,:])

    def animate(i):
        im.set_data(video[i,:,:,:])
        return im

    anim = animation.FuncAnimation(
        fig, animate, init_func=init, frames=video.shape[0], interval=50)
    display(HTML(f"<h1>{title}</h1>" + anim.to_html5_video()))


## Obtaining a clip candidate

To optain a clip candidate, we perform two different steps

### 1. Download a video

Download a video that contains an activity belonging to the previously assambled actions list. 

In this case we download a `making pizza` video.

In [None]:
!youtube-dl -f160 http://youtube.com/watch?v=-8Appls4ZFg

[youtube] -8Appls4ZFg: Downloading webpage
[download] Making pizza--8Appls4ZFg.mp4 has already been downloaded
[K[download] 100% of 3.26MiB


We can read the video using `torchvision.io` package. `read_video` function returns a video formatted as `[T, H, W, C]`, where:

- T is the total number of frames
- H, W are the height and width
- C is the number of channels used to encode the color. Usually RGB

In [None]:
!youtube-dl -f160 http://youtube.com/watch?v=-8Appls4ZFg

[youtube] -8Appls4ZFg: Downloading webpage
[download] Making pizza--8Appls4ZFg.mp4 has already been downloaded
[K[download] 100% of 3.26MiB


In [None]:
mp4 = open('/content/Making pizza--8Appls4ZFg.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

In [None]:
video, _, info = torchvision.io.read_video(
    '/content/Making pizza--8Appls4ZFg.mp4', pts_unit='sec')
print('Video shape:', video.size())
print('Video info:', 'Video(' + ', '.join(f'{k}={v}' for k, v in info.items()) + ')')

Video shape: torch.Size([3729, 144, 82, 3])
Video info: Video(video_fps=14.952991532868545)


### 2. Temporal positioning within a video

Image classifiers are fast and simple to train. Moreover, out there, we can fine pretrained classifiers for a large number of human actions. These classifiers are obtained by tracking user actions on Google Image Search. 

In our example, we use a pretrained classifer trained on images resulting from Google Searches of Kinetics-400 actions list.

This classifier is run at the frame level over the `making pizza` video from the previous step. From the clip with highest respones, we generate the candidate clip going back and forward $ \frac{s}{2} $ where $s$ is the candidate clip desired duration in seconds.

First we import an image classifier pretrained from `ar` package.


In [None]:
classifier = ar.image.ImageClassifier.from_pretrained('sf-densenet-kinetics-400')
classifier.eval()
classifier.to(device);

Load the Kinetics action list to map the model output to names.

In [None]:
classes = Path('/content/activity-recognition/data/kinetics400.names').read_text().split('\n')
classes[:10]

['abseiling',
 'air drumming',
 'answering questions',
 'applauding',
 'applying cream',
 'archery',
 'arm wrestling',
 'arranging flowers',
 'assembling computer',
 'auctioning']

In [None]:
#@markdown To make the video processing faster, we skip some frames. Since frames are really correlated, skipping frames is not a problem. <br>
#@markdown Number of frames to skip for each prediction
SKIP_FRAMES = 2 #@param {type: "slider", min:1, max:10 }

#@markdown Length in seconds of the generated candidate clip
CLIP_LEN = 5 #@param {type: "slider", min:2, max:10 }

In [None]:
# Video tensor after skipping frames
video_t = video[::SKIP_FRAMES]

# Resize and normalize the video
tfms = T.Compose([
    VT.VideoToTensor(),
    VT.VideoResize((128, 128)),
    VT.VideoNormalize(**VT.imagenet_stats)
])

For memory saving, we make batches of frames and we feed them to the classifier.

In [None]:
video_t = tfms(video_t)
video_t = video_t.permute(1, 0, 2, 3)

batch = 64
predictions = []
with torch.no_grad():
    for i in tqdm.trange(0, video_t.size(0), batch):
        inp = video_t[i: i + batch]
        preds = classifier(inp.to(device))
        predictions.append(preds.cpu())

predictions = torch.cat(predictions, dim=0)

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




For each frame, we obtain the probability distribution of actions being present in the video.

In [None]:
predictions.size()

torch.Size([1865, 400])

In [None]:
max_probs, labels = predictions.max(dim=-1)
max_response_frame = max_probs.argmax()

print('Hihgest respones action:', classes[labels[max_response_frame]])

Hihgest respones action: making pizza


Once we know the frame with highest respone, we can simple generate the clip going $\frac{s}{2}$ seconds backward, and $\frac{s}{2}$ forward


In [None]:
candidate_frame = max_response_frame * SKIP_FRAMES
time_window_seconds = CLIP_LEN / 2
time_window_frames = int(time_window_seconds * info['video_fps'])
candidate_clip = video[candidate_frame - time_window_frames: 
                       candidate_frame + time_window_frames]

In [None]:
display_video(candidate_clip, title='Candidate')

## Conclusion

That's all for the "Obtaining clip candidates" section. The next steps involve humans to validate that the candidate clips actually contain the expected action.