# Classifying Violent/Non-Violent Actions
We will be using the dataset [`Real Life Violence Situations`](https://paperswithcode.com/dataset/real-life-violence-situations-dataset) found on Papers With Code. The dataset contains 1000 videos of violent actions and 1000 videos of non-violent actions. With the dataset we will train a video classification model to declare if a video is violent or non-violent. 

If the video has a violent action the **goal** is to return the snippet of the video that was declared as violent.

## TODO
1. Finish the initial model by the end of <mark>Spring Break</mark> (**everyone**)
    a. Resolve training problem
2. ~~Finish uploading 394 videos to dataset (**Gueren**)~~

## Exploring the data
We will be using the RVLS dataset. The anomaly detction model will use the avenue dataset.

In [1]:
!ls /datasets/anomaly-detection/Real\ Life\ Violence\ Dataset -R

ls: cannot access '/datasets/anomaly-detection/Real Life Violence Dataset': No such file or directory


As you can see there are 1000 violent and non-violent videos each. We can view one of the videos with IPython's `display` and `HTML` libraries. Below we created a function to display the video so we can look at any videos that could be causing us issues later on.

In [2]:
from IPython.display import display, HTML 
from base64 import b64encode

def display_video(path):
    mp4 = open(path,'rb').read()   
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    display(
        HTML(
        """
          <video width=400 controls>
                <source src="%s" type="video/mp4">
          </video>
        """ % data_url
        )
    )

**Example of a Non-Violent Action**

In [3]:
display_video('/datasets/anomaly-detection/Real Life Violence Dataset/Testing/NonViolence/NV_999.mp4')

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/anomaly-detection/Real Life Violence Dataset/Testing/NonViolence/NV_999.mp4'

**Example of a Violent Action**

In [0]:
display_video('/datasets/anomaly-detection/Real Life Violence Dataset/Training/Violence/V_369.mp4')

## Begin Combining the Hugging Face Tutorial and Surveillance Videos App Here
Follow along with the tutorial [here.](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/video_classification.ipynb) Just make sure you switch out the model for the one we are using.

## Model
We will finetune the [VideoMAE](https://huggingface.co/docs/transformers/main/model_doc/videomae) pretrained on the [Kinetics 400 dataset](https://www.deepmind.com/open-source/kinetics). The benefit of using a pretrained model is that it already has a basic understanding of the inner representations of a video.

### How does VideoMAE work?
VideoMAE stands for Video Masked Autoencoder, masked autoencoder meaning a scalable self-supervised learner for computer vision. Basically, it removes chunks of an image and (in pre-training) the model must reconstruct raw pixel values. VideoMAE follows the same idea and centers around the following pipeline:
1. Temporal downsampling
    - A video clip is randomly sampled from the original video and its frames are compressed/trimmed, each frame containing $$W \times H \times 3$$ pixels.
2. Cube embedding
    - Each cube is the size $$2 \times 16 \times 16$$  **or**  $$time\ (frames) \times height \times width$$. Thus, the cube embedding layer (one frame) produces $${T \over 2} \times {H \over 16} \times {W \over 16}$$ 3D tokens, reducing the amount of space and frames from the input.
3. Tube masking with high ratios
    - Uses masking ratios between 90% to 95%, meaning that most of the cubes are unused. This helps mitigate information leakage and forces the model to learn more. 
    - Tube masking means that cubes masked in one frame will remain masked in the next frame. This method reduces information leakage compared to the other masking methods. (View image below for example of tube masking)
4. Backbone
    - Uses a vison transformer (ViT) as its backbone. 
        - ViT is a model that is used for image classification. It uses self-attention to extract features from an image. Then it splits the images into tokens and transforms them to a linear representation. Then a multi-layer perception (MLP) is used to extract unique features, which helps classify the image.
    - The ViT is applied to the unmasked cubes and extracts features from them. In pre-training, the model must try to reconstruct the pixels in these frames based on these extracted features. In testing, the extracted features are used form a prediction on which class their belong too.

<div style="text-align: center;"><b>Visual Demonstration</b></div>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"/>


Another model we could use is the [X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip#transformers.XCLIPVisionModel) model.

Epoch 
- Number of complete passes through the dataset
Batch
- Number of samples before the model updatees

In [0]:
model_ckpt = "MCG-NJU/videomae-base" # pre-trained model from which to fine-tune
batch_size = 8 # batch size for training and evaluation

## Loading the dataset
We initialize the root path for the dataset with `pathlib`

In [0]:
import pathlib

dataset_root_path = "/datasets/anomaly-detection/Real Life Violence Dataset"
dataset_root_path = pathlib.Path(dataset_root_path)

Next we set up `all_video_file_paths` to contain the location of all the videos.

In [0]:
all_video_file_paths = (
    list(dataset_root_path.glob("Training/*/*.mp4")) +
    list(dataset_root_path.glob("Validation/*/*.mp4")) +
    list(dataset_root_path.glob("Testing/*/*.mp4"))
)

all_video_file_paths[:5]

Then, we establish two dictionaries that will be helpful initializing the model.

- `label2id`: maps the class names to integers.
- `id2label`: maps the integers to class names. 

In [0]:
class_label = sorted({str(path).split("/")[5] for path in all_video_file_paths})
label2id = {label: i for i, label in enumerate(class_label)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {list(label2id.keys())}")

As we can see we have 2 unique classes, each class having 800 training videos.

## Loading the model
In the next cell, we initialize the `VideoMAE` model where the encoder is initialized with the pre-trained paraemters and the classification head is random initialized. We also initialized the feature extractor for the VideoMAE model, which will be used later.

In [0]:
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification

feature_extractor = VideoMAEFeatureExtractor.from_pretrained(model_ckpt)
model = VideoMAEForVideoClassification.from_pretrained(
    model_ckpt,
    label2id = label2id,
    id2label = id2label,
    ignore_mismatched_sizes = True, # this is neccessary for fine-tuning models
)

The warnings are telling us that we are throwing away the weights and bias of the `classifier` layer, which are used to classify videos in the pre-trained model. Since we are using different classifications, we must discard those waits and train new waits.

## Constructing the datasets for training
For preprocessing we'll leverage the [PyTorch Video library](https://pytorchvideo.org/). The block below initializes the dependencies we need. 

In [0]:
import pytorchvideo.data

from pytorchvideo.transforms import (
    ApplyTransformToKey,
    Normalize,
    RandomShortSideScale,
    RemoveKey,
    ShortSideScale,
    UniformTemporalSubsample,
)

from torchvision.transforms import (
    Compose,
    Lambda,
    RandomCrop,
    RandomHorizontalFlip,
    Resize,
)

For the training dataset transformations, we use a combination of uniform temporal subsampling, pixel normalization, random cropping, and random horizontal flipping. 

For the validation and evaluation dataset transformations, we keep the transformation chain the same except for random cropping and horizontal flipping.

To learn more about the details of these transformations check out the [official documentation of PyTorch Video](https://pytorchvideo.org). The following blocks follows the [official PyTorch Video example.](https://pytorchvideo.org/docs/tutorial_classification#dataset)  

In [0]:
import os

mean = feature_extractor.image_mean
std = feature_extractor.image_std
resize_to = feature_extractor.size['shortest_edge']

num_frames_to_sample = model.config.num_frames
sample_rate = 4
fps = 30
clip_duration = num_frames_to_sample * sample_rate / fps


# Training dataset transformations.
train_transform = Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=Compose(
                [
                    UniformTemporalSubsample(num_frames_to_sample),
                    Lambda(lambda x: x / 255.0),
                    Normalize(mean, std),
                    RandomShortSideScale(min_size=256, max_size=320),
                    RandomCrop(resize_to),
                    RandomHorizontalFlip(p=0.5),
                ]
            ),
        ),
    ]
)

# Validation and evaluation datasets' transformations.
val_transform = Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=Compose(
                [
                    UniformTemporalSubsample(num_frames_to_sample),
                    Lambda(lambda x: x / 255.0),
                    Normalize(mean, std),
                    Resize((resize_to, resize_to)),
                ]
            ),
        ),
    ]
)

We have to set up our own dataset decleration with [`pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.LabeledVideoDataset). This applies the transformations above onto the dataesets we provide it.

In [0]:
# Training dataset.
train_dataset = pytorchvideo.data.labeled_video_dataset(
    data_path=os.path.join(dataset_root_path, "Training"),
    clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
    decode_audio=False,
    transform=train_transform,
)

# Validation and evaluation datasets.
val_dataset = pytorchvideo.data.labeled_video_dataset(
    data_path=os.path.join(dataset_root_path, "Validation"),
    clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
    decode_audio=False,
    transform=val_transform,
)

test_dataset = pytorchvideo.data.labeled_video_dataset(
    data_path=os.path.join(dataset_root_path, "Testing"),
    clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
    decode_audio=False,
    transform=val_transform,
)

We can access the `num_videos` argument to know the number of videos we have in the dataset.

In [0]:
train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos

As you can see our current structure for the dataset is:
> 80% training
> 10% validation
> 10% testing

Now lets take a look at one of the preprocessed videos.

In [0]:
sample_video = next(iter(train_dataset))
sample_video.keys()

Below, you see that the shape of the video sample starts with 3. The shape will always start with 3 because you need to represent the RGB values. This is the same reason why the trasnformations above divided by 255. The shape follows the format:
> (RGB, frames, height, width)

In [0]:
def investigate_video(sample_video):
    for k in sample_video:
        if k == "video":
            print(k, sample_video["video"].shape)
        else:
            print(k, sample_video[k])
    print(f"Video label: {id2label[sample_video[k]]}")

investigate_video(sample_video)

Now we will create a way to visualize the specific frames in `sample data`. The difference between the function below and 'display_video' is that display_video will show the whole video while `create_gif` only shows the data being used.

In [0]:
import imageio
import numpy as np
from IPython.display import Image

def unnormalize_img(img):
    img = (img * std) + mean
    img = (img * 255).astype("uint8")
    return img.clip(0, 255)

The function `unnormalize_img` reverses the transformations by
```
Compose[
    Lambda(lambda x: x / 255.0),
    Normalize(mean, std)
]

Prepares a GIF from a video tensor. The video tensor is expected to have the following shape:
> (num_frames, num_channels, height, width).

In [0]:
def create_gif(video_tensor, filename="sample.gif"):
    frames = []
    for video_frame in video_tensor:
        frame_unnormalized = unnormalize_img(video_frame.permute(1,2,0).numpy())
        frames.append(frame_unnormalized)
    kargs = {"duration": (len(video_tensor)/60)} # num frames / 60 fps
    imageio.mimsave(filename, frames, "GIF", **kargs)
    return filename

Prepares and displays a GIF from a video tensor.

In [0]:
def display_gif(video_tensor, gif_name="sample.gif"):
    video_tensor = video_tensor.permute(1, 0, 2, 3)
    gif_filename = create_gif(video_tensor, gif_name)
    return Image(filename=gif_filename)

In [0]:
video_tensor = sample_video["video"]
display_gif(video_tensor)

## Training the Model

Before training the model, we need to prepare and set up arguments that will be passed to the model

In [0]:
!pip show transformers
from transformers import TrainingArguments, Trainer

# set up model name
model_name = model_ckpt.split("/")[-1]
new_model_name = f"{model_name}-finetuned-RealLifeViolenceSituations-subset"
num_epochs = 4

# set up a subset of arguments
args = TrainingArguments(
    new_model_name,
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
    max_steps=(train_dataset.num_videos // batch_size) * num_epochs,
)

Load a type of metric that we want to measure our model: accuracy

In [0]:
import evaluate

metric = evaluate.load("accuracy")

Define a function to compute accuracy. This function will later be passed to the Trainer as an argument. 

In [0]:
# the compute_metrics function takes a Named Tuple as input:
# predictions, which are the logits of the model as Numpy arrays,
# and label_ids, which are the ground-truth labels as Numpy arrays.
def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions."""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

Define a function that batches examples together. It will be passed into Trainer as an argument.
Each batch has two keys: pixel_values and labels

In [0]:
import torch


def collate_fn(examples):
    """The collation function to be used by `Trainer` to prepare data batches."""
    # permute to (num_frames, num_channels, height, width)
    pixel_values = torch.stack(
        [example["video"].permute(1, 0, 2, 3) for example in examples]
    )
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

Create Trainer object. This will be our model.
The below block needs token from Hugging Face to execute.
Token is used: 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

There are some packages required to be installed in order to loggin to HuggingFace and provide token.

In [0]:

!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!apt-get install git-lfs
!git-lfs install
!pip install huggingface_hub
!pip install ipywidgets
# loggin to HuggingFace with token
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('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')"

from huggingface_hub import notebook_login
notebook_login()

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
)

Train the model

***Note: I got this error while training:

_A kernel interruption error usually occurs for one of the following reasons:
The kernel process runs out of RAM. In this case, click here to upgrade your machine.
There is a bug in one of the libraries (e.g., version conflicts, missing binary dependency, etc_

I think it's more likely that we runs out of RAM because I notice our RAM is all the way up to 5 GB and then this error pops up. Our basic plan has only 5 GB.

In [0]:
train_results = trainer.train()

***Code from here and below, cannot be executed because of the code above has problem executing***

Evaluate the model with the test dataset


In [0]:
trainer.evaluate(test_dataset)

Save model from the trainer
Save the evaluation of testing dataset on the model


In [0]:
trainer.save_model()
test_results = trainer.evaluate(test_dataset)
trainer.log_metrics("test", test_results)
trainer.save_metrics("test", test_results)
trainer.save_state()

# upload to hub
trainer.push_to_hub()

## Interface

After saving our trained model, we load it again to classify our dataset

In [0]:
trained_model = VideoMAEForVideoClassification.from_pretrained(new_model_name)

Display all features of the first video in the dataset

In [0]:
sample_test_video = next(iter(test_dataset))
investigate_video(sample_test_video)

Define a function that accepts a model and a video, where the model will classifies the video.

The logit values for both violence and non-violence will be returned. Later, we will only use the highest one.

In [0]:
def run_inference(model, video):
    """Utility to run inference given a model and test video.
    
    The video is assumed to be preprocessed already.
    """
    # (num_frames, num_channels, height, width)
    perumuted_sample_test_video = video.permute(1, 0, 2, 3)

    inputs = {
        "pixel_values": perumuted_sample_test_video.unsqueeze(0),
        "labels": torch.tensor(
            [sample_test_video["label"]]
        ),  # this can be skipped if you don't have labels available.
    }
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    model = model.to(device)

    # forward pass
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    return logits

Call the function above, pass the trained model and the sample video obtained before defining the funciton.

In [0]:
logits = run_inference(trained_model, sample_test_video["video"])

Displaying gif of the sample test video

In [0]:
display_gif(sample_test_video["video"])

The maximum logit values, returned by the run_inference function, will be used. We won't use the logit value but its index along the axis -1 (the last dimension). This index will be converted into a label and displayed.

In [0]:
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6c0d6237-fce8-4a74-a541-a7cac87607e4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>