# Fine-tuning for Video Classification with ViViT
### Abstract
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

https://arxiv.org/pdf/2103.15691

![image.png](./static/vivit.png)


## Embeddings
### Uniform frame sampling 
straightforward method of tokenising the input video is to uniformly sample nt frames from the input video clip, embed each 2D frame independently using the same method as ViT, and concatenate all these tokens together. Concretely, if nh · nw non-overlapping image patches are extracted from each frame, then a total of nt ·nh·nw tokens will be forwarded through the transformer encoder.Intuitively, this process may be seen as simply constructing a large 2D image to be tokenised following ViT

#### Tubelet embedding
An alternate method, to extract non-overlapping, spatio-temporal “tubes” from the input volume, and to linearly project this to Rd. This method is an extension of ViT’s embedding to 3D,and corresponds to a 3D convolution. 

### HF Vivit
https://huggingface.co/docs/transformers/main/model_doc/vivit

# Dataset
https://paperswithcode.com/dataset/kinetics-400-1

# Download Dataset sayakpaul/ucf101-subset
#### Complete UCF101
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.

With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories.

https://www.crcv.ucf.edu/research/data-sets/ucf101/

In [1]:
from huggingface_hub import hf_hub_download
import os
hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset", local_dir="./data")
file_path


  from .autonotebook import tqdm as notebook_tqdm


'data/UCF101_subset.tar.gz'

In [2]:
os.getcwd()

'/Users/layhenghok/Desktop/SUSTech/Year3Semester2/CS326-Group-Projects-II/Code/ViViT/ViViT-Driving-Scene'

In [3]:
import tarfile
import os
with tarfile.open("./data/UCF101_subset.tar.gz") as t:
     t.extractall("./data")

In [2]:
from transformers import Trainer, TrainingArguments, AdamW
from model_configuration import *
from transformers import Trainer
from preprocessing import create_dataset
from data_handling import frames_convert_and_create_dataset_dictionary
from model_configuration import initialise_model
import wandb

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from dotenv import load_dotenv
import os
env_path =  ".env"
load_dotenv(env_path)

False

# Base Model

https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

### google/vivit-f-16x2-kinetics400

![image.png](./static/models.png)


##### https://huggingface.co/docs/transformers/main/model_doc/vivit

In [6]:
import model_configuration
from model_configuration import compute_metrics
import cv2
import av
from data_handling import sample_frame_indices, read_video_pyav

In [7]:
container = av.open("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")

In [8]:
container.streams.video[0].frames

209

In [23]:
import moviepy.editor

In [10]:
container = av.open("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")
indices = sample_frame_indices(clip_len=50, frame_sample_rate=2,seg_len=container.streams.video[0].frames)
video = read_video_pyav(container=container, indices=indices)

In [11]:
indices

array([ 27,  29,  31,  33,  35,  37,  39,  41,  43,  45,  47,  49,  51,
        53,  55,  57,  59,  61,  63,  65,  67,  69,  71,  73,  75,  78,
        80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102, 104,
       106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126])

In [13]:
video.shape

(50, 224, 224, 3)

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [15]:
path_files = "./data/UCF101_subset"
video_dict, class_labels = frames_convert_and_create_dataset_dictionary(path_files)

Processing file ./data/UCF101_subset/test/BalanceBeam/v_BalanceBeam_g11_c02.avi number of Frames: 68
Processing file ./data/UCF101_subset/test/BalanceBeam/v_BalanceBeam_g11_c04.avi number of Frames: 116
Processing file ./data/UCF101_subset/test/BalanceBeam/v_BalanceBeam_g20_c01.avi number of Frames: 84
Processing file ./data/UCF101_subset/test/BalanceBeam/v_BalanceBeam_g20_c03.avi number of Frames: 100
Processing file ./data/UCF101_subset/test/BaseballPitch/v_BaseballPitch_g11_c02.avi number of Frames: 74
Processing file ./data/UCF101_subset/test/BaseballPitch/v_BaseballPitch_g11_c04.avi number of Frames: 58
Processing file ./data/UCF101_subset/test/BaseballPitch/v_BaseballPitch_g24_c02.avi number of Frames: 119
Processing file ./data/UCF101_subset/test/BaseballPitch/v_BaseballPitch_g24_c06.avi number of Frames: 120
Processing file ./data/UCF101_subset/test/BaseballPitch/v_BaseballPitch_g24_c04.avi number of Frames: 116
Processing file ./data/UCF101_subset/test/BaseballPitch/v_Baseball

In [16]:
len(video_dict)

405

In [17]:
video_dict[0].keys()

dict_keys(['video', 'labels'])

In [18]:
video_dict[0]['video'].shape

(10, 224, 224, 3)

In [19]:
video_dict[0]['labels']

'BalanceBeam'

In [20]:
num_frames, height, width, channels =  video_dict[0]['video'].shape
num_frames, height, width, channels 

(10, 224, 224, 3)

# Display Video sample

In [21]:
filename = "./tmp/saved.mp4"
codec_id = "mp4v" # ID for a video codec.
fourcc = cv2.VideoWriter_fourcc(*codec_id)
out = cv2.VideoWriter(filename, fourcc=fourcc, fps=2, frameSize=(width, height))

for frame in np.split(video_dict[0]['video'], num_frames, axis=0):
    out.write(frame)


(<unknown>:16700): GStreamer-CRITICAL **: 21:40:35.496: gst_element_make_from_uri: assertion 'gst_uri_is_valid (uri)' failed


In [22]:
container2 = av.open("./static/sd.mp4")
moviepy.editor.ipython_display(container2.name)

In [23]:
class_labels = sorted(class_labels)
label2id = {label: i for i, label in enumerate(class_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {list(label2id.keys())}.")

Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'].


In [24]:
shuffled_dataset = create_dataset(video_dict)

Casting to class labels: 100%|██████████| 405/405 [00:00<00:00, 1204.29 examples/s]
Map: 100%|██████████| 405/405 [11:12<00:00,  1.66s/ examples]
Map: 100%|██████████| 405/405 [02:41<00:00,  2.51 examples/s]


In [25]:
shuffled_dataset['train'].features

{'labels': ClassLabel(names=['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'], id=None),
 'pixel_values': Sequence(feature=Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None)}

In [26]:
model = model_configuration.initialise_model(shuffled_dataset, device)

Some weights of VivitForVideoClassification were not initialized from the model checkpoint at google/vivit-b-16x2-kinetics400 and are newly initialized because the shapes did not match:
- vivit.embeddings.position_embeddings: found shape torch.Size([1, 3137, 768]) in the checkpoint and torch.Size([1, 981, 768]) in the model instantiated
- classifier.weight: found shape torch.Size([400, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([400]) in the checkpoint and torch.Size([10]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
training_output_dir = "/tmp/results"
training_args = TrainingArguments(
    output_dir=training_output_dir,         
    num_train_epochs=3,             
    per_device_train_batch_size=2,   
    per_device_eval_batch_size=2,    
    learning_rate=5e-05,            
    weight_decay=0.01,              
    logging_dir="./logs",           
    logging_steps=10,                
    seed=42,                       
    evaluation_strategy="steps",    
    eval_steps=10,                   
    warmup_steps=int(0.1 * 20),      
    optim="adamw_torch",          
    lr_scheduler_type="linear",      
    # fp16=True,  
    report_to="wandb"
)

In [28]:
wandb_key =  os.getenv("WANDB_API_KEY")
wandb.login(key=wandb_key)

PROJECT = "ViViT"
MODEL_NAME = "google/vivit-b-16x2-kinetics400"
DATASET = "sayakpaul/ucf101-subset"

wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning ViViT with ucf101-subset")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mhoklayheng[0m ([33mhoklayheng-southern-university-of-science-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [29]:
import torch

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-05, betas=(0.9, 0.999), eps=1e-08)
# Define the trainer
trainer = Trainer(
    model=model,                      
    args=training_args,              
    train_dataset=shuffled_dataset["train"],      
    eval_dataset=shuffled_dataset["test"],       
    optimizers=(optimizer, None),  
    compute_metrics = compute_metrics
)

In [30]:
with wandb.init(project=PROJECT, job_type="train", # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes =f"Fine tuning {MODEL_NAME} with {DATASET}."):
           train_results = trainer.train()



Step,Training Loss,Validation Loss,Accuracy
10,2.3493,2.183971,0.268293
20,2.0869,1.818844,0.512195
30,1.6602,1.61266,0.512195
40,1.5172,1.348984,0.731707
50,1.2714,1.301148,0.682927
60,1.1721,1.455876,0.634146
70,1.3161,1.115609,0.658537
80,0.9991,0.859106,0.829268
90,0.701,0.746046,0.731707
100,0.6524,0.754186,0.780488


0,1
eval/accuracy,▁▃▆▅▅▇▆▆▆▇█▇▇██▇██▇▇▇███▇███████████████
eval/loss,█▇▆▅▅▄▃▃▃▂▂▂▂▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,▁▁▄▃▃▅▆▃▆▅▆▄▄▇▄▃▆▅▃▅▇▄▁▅▅▄█▄▇▂▄▃█▆█▇▄▅▅▂
eval/samples_per_second,█▅▆▆▆▃▅▃▄▁▄▅▂▅▅▅▃▆▄▅▅█▄▄▆▄▂▅▇▄▆▁▅▅▁▂▅▄▄▇
eval/steps_per_second,▅▇▆▆▄▆▃▄▁▃▂▅▅▆▃▆▅▂▅█▄▆▅▄▇▅▇▄▅▆▅▅▃▁▁▅▄▄▄▇
train/epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,█▅▅▃▄▃▃▃▃▃▂▅▆▅▁▂▁▄▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,█▇▆▅▄▃▃▂▃▃▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/accuracy,0.92683
eval/loss,0.12538
eval/runtime,21.4256
eval/samples_per_second,1.914
eval/steps_per_second,0.98
total_flos,8.580287827825459e+17
train/epoch,3.0
train/global_step,546.0
train/grad_norm,0.07833
train/learning_rate,0.0


In [31]:
trainer.save_model("model")
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =         3.0
  total_flos               = 799101575GF
  train_loss               =      0.3613
  train_runtime            =  0:34:22.73
  train_samples_per_second =       0.529
  train_steps_per_second   =       0.265


In [32]:
custom_path = "./model"

In [61]:
with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("ViViT-Fine-Tuned", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)


[34m[1mwandb[0m: Adding directory to artifact (./model)... Done. 0.7s


# Inference

In [4]:
path_files_val = "./data/UCF_101_subset"
video_dict_val, class_labels_val = frames_convert_and_create_dataset_dictionary(path_files_val)

Processing file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g11_c02.avi number of Frames: 68
Processing file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g11_c04.avi number of Frames: 116
Processing file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g20_c01.avi number of Frames: 84
Processing file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g20_c03.avi number of Frames: 100
Processing file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g11_c02.avi number of Frames: 74
Processing file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g11_c04.avi number of Frames: 58
Processing file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g24_c02.avi number of Frames: 119
Processing file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g24_c06.avi number of Frames: 120
Processing file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g24_c04.avi number of Frames: 116
Processing file ./data/UCF_101_subset/test/BaseballPitch/

In [5]:
val_dataset = create_dataset(video_dict_val)

Casting to class labels: 100%|██████████| 75/75 [00:00<00:00, 2977.39 examples/s]
Map: 100%|██████████| 75/75 [01:24<00:00,  1.13s/ examples]
Map: 100%|██████████| 75/75 [00:30<00:00,  2.48 examples/s]


In [6]:
import wandb
run = wandb.init()
artifact = run.use_artifact('hoklayheng-southern-university-of-science-technology/uncategorized/ViViT-Fine-Tuned:v0', type='model')
artifact_dir = artifact.download()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mhoklayheng[0m ([33mhoklayheng-southern-university-of-science-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact ViViT-Fine-Tuned:v0, 331.90MB. 3 files... 
[34m[1mwandb[0m:   3 of 3 files downloaded.  
Done. 0:0:1.4


In [7]:
artifact_dir

'/Users/layhenghok/Desktop/SUSTech/Year3Semester2/CS326-Group-Projects-II/Code/ViViT/ViViT-Driving-Scene/artifacts/ViViT-Fine-Tuned:v0'

In [8]:
val_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'pixel_values'],
        num_rows: 67
    })
    test: Dataset({
        features: ['labels', 'pixel_values'],
        num_rows: 8
    })
})

In [9]:
from data_handling import generate_all_files
import os
import numpy as np
import av
from pathlib import Path
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    '''
    Sample a given number of frame indices from the video.
    Args:
        clip_len (`int`): Total number of frames to sample.
        frame_sample_rate (`int`): Sample every n-th frame.
        seg_len (`int`): Maximum allowed index of sample's last frame.
    Returns:
        indices (`List[int]`): List of sampled frame indices
    '''
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

In [10]:
labels = val_dataset['train'].features['labels'].names
config = VivitConfig.from_pretrained(artifact_dir)
config.num_classes=len(labels)
config.id2label = {str(i): c for i, c in enumerate(labels)}
config.label2id = {c: str(i) for i, c in enumerate(labels)}
config.num_frames=10
config.video_size= [10, 224, 224]

In [11]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [13]:
from transformers import VivitImageProcessor, VivitForVideoClassification

In [14]:
image_processor = VivitImageProcessor.from_pretrained("google/vivit-b-16x2-kinetics400")
fine_tune_model = VivitForVideoClassification.from_pretrained(artifact_dir,config=config)

In [15]:
directory =  "./data/UCF_101_subset"

In [16]:
class_labels = []
true_labels=[]
predictions = []
predictions_labels = []
all_videos=[]
video_files= []
sizes = []
i = 0
for p in generate_all_files(Path(directory), only_files=True):
    set_files = str(p).split("/")[2] # train or test
    cls = str(p).split("/")[3] # class
    file= str(p).split("/")[4] # file name
    #file name path
    file_name= os.path.join(directory, set_files, cls, file)
    true_labels.append(cls)   
    # Process class
    if cls not in class_labels:
        class_labels.append(cls)
    # process video File
    container = av.open(file_name)
    #print(f"Processing file {file_name} number of Frames: {container.streams.video[0].frames}")  
    indices = sample_frame_indices(clip_len=10, frame_sample_rate=1,seg_len=container.streams.video[0].frames)
    video = read_video_pyav(container=container, indices=indices)
    inputs = image_processor(list(video), return_tensors="pt")
    with torch.no_grad():
        outputs = fine_tune_model(**inputs)
        logits = outputs.logits

    # model predicts one of the 400 Kinetics-400 classes
    predicted_label = logits.argmax(-1).item()
    prediction = fine_tune_model.config.id2label[str(predicted_label)]
    predictions.append(prediction)
    predictions_labels.append(predicted_label)
    print(f"file {file_name} True Label {cls}, predicted label {prediction}")

file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g11_c02.avi True Label BalanceBeam, predicted label BalanceBeam
file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g11_c04.avi True Label BalanceBeam, predicted label BalanceBeam
file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g20_c01.avi True Label BalanceBeam, predicted label BalanceBeam
file ./data/UCF_101_subset/test/BalanceBeam/v_BalanceBeam_g20_c03.avi True Label BalanceBeam, predicted label BalanceBeam
file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g11_c02.avi True Label BaseballPitch, predicted label BaseballPitch
file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g11_c04.avi True Label BaseballPitch, predicted label BaseballPitch
file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g24_c02.avi True Label BaseballPitch, predicted label BaseballPitch
file ./data/UCF_101_subset/test/BaseballPitch/v_BaseballPitch_g24_c06.avi True Label BaseballPitch, predicted label Baseba

In [17]:
from sklearn.metrics import classification_report

In [18]:
report = classification_report(true_labels, predictions)
print(report)

                precision    recall  f1-score   support

ApplyEyeMakeup       1.00      1.00      1.00         6
 ApplyLipstick       1.00      1.00      1.00         4
       Archery       1.00      1.00      1.00         7
  BabyCrawling       1.00      1.00      1.00         9
   BalanceBeam       1.00      1.00      1.00         4
  BandMarching       1.00      1.00      1.00         9
 BaseballPitch       1.00      1.00      1.00        10
    Basketball       1.00      1.00      1.00        11
BasketballDunk       1.00      1.00      1.00         6
    BenchPress       1.00      1.00      1.00         9

      accuracy                           1.00        75
     macro avg       1.00      1.00      1.00        75
  weighted avg       1.00      1.00      1.00        75



In [19]:
file_name = "./static/v_BasketballDunk_g14_c04.mp4"
container = av.open(file_name)

In [24]:
moviepy.editor.ipython_display(container.name)

In [25]:
indices = sample_frame_indices(clip_len=10, frame_sample_rate=3,seg_len=container.streams.video[0].frames)
print(f"Processing file {file_name} number of Frames: {container.streams.video[0].frames}")  
video = read_video_pyav(container=container, indices=indices)
inputs = image_processor(list(video), return_tensors="pt")

Processing file ./static/v_BasketballDunk_g14_c04.mp4 number of Frames: 88


In [26]:
with torch.no_grad():
    outputs = fine_tune_model(**inputs)
    logits = outputs.logits

In [27]:
predicted_label = logits.argmax(-1).item()
prediction = fine_tune_model.config.id2label[str(predicted_label)]
prediction

'BasketballDunk'