### Tech Exellence Advanced Data Science - Generative AI Video Classification Project

##### Project Authors: Tim Tieng, Afia Owusu-Forfie

**Objective**: Develop a model to classify video content into categories such as sports, news, movies, etc., and enhance this classification by generating descriptive captions or summaries that provide additional context about the content. This can be particularly useful for content curation platforms, accessibility applications (e.g., providing descriptions for the hearing impaired), or educational tools where supplementary information enhances learning.

**Data**: Public Dataset: k400-Dataset, which has a vast collection of labeled video data suitable for training video classification models.

In [None]:
# Import Packages for project

# Standard Libraries
import csv
import cv2
import imageio
from IPython.display import Image
import glob
import numpy as np
import os
import pandas as pd
import re
import utils

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Algorithms, Modeling and Data Pre-processing
import feature_engine
from feature_engine.encoding import OrdinalEncoder
from feature_engine.transformation import YeoJohnsonTransformer
from scipy.stats import anderson, chi2_contingency
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,f1_score,precision_score, roc_auc_score,recall_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Deep Learning
import keras
from keras import layers
from keras.layers import RandomFlip, RandomRotation, Rescaling, BatchNormalization, Conv2D, MaxPooling2D, Dense, Input
from keras.models import Model, Sequential
from keras.optimizers import Adam, SGD
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import callbacks
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Model Optimization and Hyperparameter Tuning
import hyperopt
from hyperopt import STATUS_OK, Trials, fmin, tpe, hp
import mlflow

import tensorboard

In [None]:
# Obtain the Data
filepath = '../data/K400/video_annotations.csv'
raw_csv = pd.read_csv(filepath)
k400_df = pd.DataFrame(raw_csv)

k400_df.info(memory_usage='deep')

In [None]:
# initial inspection of complete dataframe
k400_df.head

In [None]:
# Check for null values/percentage of null values:

k400_df.isna().mean()

In [None]:
k400_df.isna().sum()

### Observations 

1. No null values
2. Over 240k Observations
3. 6 Attributes of string/int datatypes

In [None]:
# check for dup
num_unique = k400_df.nunique()
num_unique

### Observations

1. THere are 400 unique labels
2. There are about 20K youtube_id with only about 850 videos
3. Videos duration is only 10 seconds as annotated by the difference between time_start and time_end values

**Next Steps**: to reduce the dimensionality, I need to create a function that will map a video file to a youtube id value in the video_annotations.csv file and create a new dataframe where we have a match. Data Cleaning required on the names of the video files

In [None]:
youtube_id_values = k400_df['youtube_id']
print(f"Total Youtube ID Values in Dataset: {youtube_id_values.count()}")

In [None]:
# Get the amount of unique youtube_id
number_unique_id = youtube_id_values.nunique()
print(f"Unique Youtube ID Values: {number_unique_id}")

In [None]:
# Check for unique values
unique_youtube_id = youtube_id_values.unique()
unique_youtube_id

In [None]:
# Get the number of video files we are working with
def count_video_files(directory):
    """
    Purpose - to get a video file count within a given directory
    Arguments - directory variable that holds the filepath to a video directory
    Returns - video_count of type integer
    
    """
    # Set the allowed video file extensions
    video_extensions = ['.mp4', '.avi', '.mov', '.mkv', '.wmv', '.flv']

    # Initialize the count
    video_count = 0

    # Iterate through all files in the directory
    for file_path in glob.glob(os.path.join(directory, '*')):
        # Check if the file has a video file extension
        if os.path.isfile(file_path) and any(file_path.lower().endswith(ext) for ext in video_extensions):
            video_count += 1

    return video_count


In [None]:
# Test Funcationality and return video count

# Provide the directory path to count video files
directory_path = '../data/K400/videos'

# Call the function to count video files
num_videos = count_video_files(directory_path)
print(f'Total number of video files: {num_videos} videos present')

### Video Observations

1. There seems to be a match with the youtube_id values in the video_annotations.csv file and the initial naming convention of the video files.
2. The videofile names have a timestamp that highlights how the 10second video frame was captured. 

**Next Steps**: In order to load in local video data correctly, I need to perform regular expressions to rename the video files to exclude the timestamps.

## Exploratory Data Analysis: Labels

For this project, we will explore what labels are present in the Kinetic 400 dataset

In [None]:
k400_labels = k400_df['label']
k400_labels.tail(10)

In [None]:
unique_labels = k400_labels.unique()
unique_labels

### Visualize the top 50 labels in the k400 dataset

In [None]:
# Count occurrences of each label
label_counts = k400_labels.value_counts()

# Select the top 10 labels
top_20_labels = label_counts.head(20)

# Create a countplot for the top 10 labels using Seaborn
plt.figure(figsize=(12, 8))
sns.barplot(y=top_20_labels.index, x=top_20_labels.values, palette='viridis')

# Customize plot
plt.title('Top 20 Most Frequent Labels')
plt.xlabel('Count')
plt.ylabel('Label')

# Show plot
plt.show()

### Observations

The most frequent label in the dataset is 'abseiling', which has a count close to 50. This suggests that videos of abseiling are very common in the dataset.

The activities can be categorized into several groups:

**Musical activities**: These include playing instruments like the violin, ukulele, trumpet, trombone, saxophone, recorder, piano, and organ. We can see that musical activities feature prominently in the dataset, indicating a possible focus on musical performance videos.

**Sports and physical activities**: This group includes pole vault, playing tennis, squash or racquetball, and kickball, among others. These activities are likely to involve dynamic movement, which can be useful for training algorithms to recognize physical actions.

**Recreational games**: Playing Monopoly is included, which is an indoor recreational game. This may suggest that tehre are other labels in the dataset that may represent indoor activities where movement may be minimal

### Data Preprocessing - Rename video file names for easier loading

Purpose - This step is required in order to extract features and data from the raw video files from the Kinetic Dataset. This will help later on when we split our data to feed into our future model.

In [None]:
def remove_timestamp(filename):
    """
    Purpose: to remove the timestampe suffix at the end of our local video files
    Arguments: filename 
    Retunrs: Cleaned filename
    """
    # Split the filename by underscores
    parts = filename.rsplit('_')

    # Filter out parts that are likely numbers
    cleaned_parts = [part for part in parts if not part.isdigit()]

    # Join the cleaned parts with underscores to form the new filename
    cleaned_filename = '_'.join(cleaned_parts)

    return cleaned_filename # Remove leading/trailing whitespaces


In [None]:
def rename_files(directory):
    """
    Purpose: To rename all the local video files in our directory for future loading 
    Arguments: Filepath to the video directory
    Returns: None

    Other Functions: Calls the remove_timestamp()
    """
    # Iterate through all files in the directory
    for filename in os.listdir(directory):
        # Check if the file is a regular file (not a directory)
        if os.path.isfile(os.path.join(directory, filename)):
            # Remove timestamp from the filename
            new_filename = remove_timestamp(filename)
            # Rename the file if the filename has changed
            if new_filename != filename:
                os.rename(os.path.join(directory, filename),
                          os.path.join(directory, new_filename))

In [None]:
# Test
video_directory = "../data/K400/videos"

rename_files(video_directory)

In [None]:
video_directory = '../data/K400/videos'

# Iterate through each YouTube ID
for youtube_id in youtube_id_values:
    # Find the corresponding video file in the directory
    for filename in os.listdir(video_directory):
        if youtube_id in filename:
            # Extract the file extension
            file_extension = os.path.splitext(filename)[1]

            # Construct the new file name without the timestamp
            new_filename = youtube_id + file_extension

            # Construct the full paths for old and new files
            old_filepath = os.path.join(video_directory, filename)
            new_filepath = os.path.join(video_directory, new_filename)

            # Rename the file
            os.rename(old_filepath, new_filepath)
            print(f'Renamed {filename} to {new_filename}')
            break

### Observations

Removed the start_time portion of the timestamp, but left the end_time timestamp in the video file name

### Define Hyperparameters

In [None]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 5

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

### Data Preparation

In [None]:
split_percent = .30

# Split the k400_df into test and train df

# Split the dataframe into train and test using pd.sample()
test_df = k400_df.sample(frac=split_percent, random_state=42)
train_df = k400_df.drop(test_df.index)

# Reset the index of the new dataframes
test_df.reset_index(drop=True, inplace=True)
train_df.reset_index(drop=True, inplace=True)

In [None]:
len(train_df)

In [None]:
len(test_df)

### Notes

train_df has 13934 Rows with 6 Attributes

test_df has 5972 rows

### Data Preprocessing - Video Data

To help with video preprocessing, I will create several functions that will help with a specific step in the process. 

**Crop_center_square()**: I begin by cropping each video frame to a centered square using my crop_center_square(frame) function to ensure uniformity in size and aspect ratio. 

**Load_video()**: I then load and preprocess the videos with my load_video(path) function, which resizes the frames to a consistent dimension and assembles them into a numpy array, capping the number at max_frames if needed. 

**Prepare_all_videos()**: Finally, I extract features from all the frames using a pre-trained InceptionV3 model in my prepare_all_videos(df, root_dir) function and create masks to accommodate sequences of varying lengths, which gives me a set of tensors ready to feed into my transformer model. 

In [None]:
def crop_center_square(frame):
    """
    Purpose: Crop a given frame of a video into a centered squared. This will create uniformity across all the video frames available.
    Arguments:
        - frame: this is a numpy array representinga  provided video frame. Dimensions of the numpy array should be at least 2-D
    Returns: a numpy array of cropped square of the input frame
    """
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]

In [None]:
def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

In [None]:
def build_feature_extractor():
    """
    Purpose: Build a feature extractor model using InceptionV3 design.
    Arguments: None
    Returns: a Keras model that takes an image shape of our hyperparameters defined earlier as input and outputs a flattened feature vector


    Notes: This function initializes the InceptionV3 model with pre-trained ImageNet weights,
    excluding the top (final fully connected) layers. It modifies the network to use
    average pooling at the end and sets the expected input shape for the images. The
    output is a keras Model that takes an image input and outputs the corresponding
    feature vector. This model can be used to extract features from frames of a video
    for further analysis or processing.
    """
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()


In [None]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["label"])
)
print(label_processor.get_vocabulary())


In [None]:
def prepare_all_videos(df, root_dir):
    """
    Purpose: to prepare local video data and labels for training in a sequence model (Transformers)
    Parameters:
        1. df: a pandas dataframe containing the columns youtube_id and label (From the schema of K400 Dataset)
        2. root_dir: The filepath to where the k400 datasets are stored locally
    Returns: A tuple containing two elements:  The first element is another tuple with
      two numpy arrays: frame_features of shape (num_samples, MAX_SEQ_LENGTH, NUM_FEATURES)
      and frame_masks of shape (num_samples, MAX_SEQ_LENGTH), both of which are prepared
      for the sequence model. The second element is a numpy array of processed labels.
    """
    num_samples = len(df)
    video_paths = df["youtube_id"].values.tolist()
    labels = df["label"].values
    labels = keras.ops.convert_to_numpy(label_processor(labels[..., None]))

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(
            shape=(
                1,
                MAX_SEQ_LENGTH,
            ),
            dtype="bool",
        )
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :], verbose=0,
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels

In [None]:
# Test functionality
train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

In [None]:
print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

### Output Explanation:

**Frame Features (13934, 20, 2048)**:
    1. 13934 represents the number of video samples in the training set.
    2. 20 indicates the fixed number of frames that have been selected or sampled from each video, which is the maximum sequence length (MAX_SEQ_LENGTH). This is the temporal dimension representing time steps in the video.
    3. 2048 is the number of features extracted from each frame by the feature extraction model (e.g., InceptionV3). These features are a compressed representation of the frame's visual content.

**Frame Masks (13934, 20)**:
    1. This array is two-dimensional.
    2. 13934 corresponds to the number of video samples, aligning with the frame_features dimension.
    3. 20 represents the same temporal sequence length, with each element being a boolean (True/False or 1/0). The mask indicates which time steps in the sequence are actual data and which are padding. A value of 1 might denote actual data (not masked), and a value of 0 might denote padding (masked).


**Next Steps**:

1. The frame features will bed into a custom video classification model using Keras with the goal of learning from the visual content. 

### Designing Custom Transformer Model for Video Classification

In this step, we will create a custom transformer model designed for video classification. To accomplish this, we will experiment with the architecture of the layers to design a function model. 

### Create callbacks for our custom model:

For this model, we will include:

1. Tensorboard callback for experiment tracking.
2. EarlyStopping Callback to stop training when a val_loss metric has stopped improving

In [None]:
# Create the log directory for tensorboard logs for our tensorboard callback
import datetime as dt

log_dir = "../Logs/fit/" + dt.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, write_images=True)

In [None]:
earlystopping_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience= 3, verbose=1)

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
        self.sequence_length = sequence_length
        self.output_dim = output_dim

    def build(self, input_shape):
        self.position_embeddings.build(input_shape)

    def call(self, inputs):
        # The inputs are of shape: `(batch_size, frames, num_features)`
        inputs = keras.ops.cast(inputs, self.compute_dtype)
        length = keras.ops.shape(inputs)[1]
        positions = keras.ops.arange(start=0, stop=length, step=1)
        embedded_positions = self.position_embeddings(positions)
        return inputs + embedded_positions

### PositionalEmbedding Class Explained

This code defines a custom layer in TensorFlow using the Keras API. The purpose of this layer is to add positional information to the input data, which is essential for models like Transformers that do not inherently process sequential data in order. This is further confirmed in the Keras documentation that self-attention layers that form the basic blocks of a Transformer are order-agnostic. Since videos are ordered sequences of frames, we need our Transformer model to take into account order information. We do this via positional encoding. We simply embed the positions of the frames present inside videos with an embedding layer

In [None]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim, dropout=0.3
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation=keras.activations.gelu),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

### Notes

The above classes will be instantiated within our get_compiled_model() below

In [None]:
def get_compiled_model(shape):
    """
    Purpose: Builds and compiles a Transformer-based model for video classification.

    This function constructs a Keras model with a Transformer encoder structure, 
    including positional embeddings for sequence data, a Transformer encoder layer, 
    and classification layers. The model takes an input of a specified shape and 
    outputs class probabilities for video classification. The model is compiled with 
    the Adam optimizer, sparse categorical cross-entropy loss, and tracks accuracy.

    Parameters:
    - shape (tuple): The shape of the input data, excluding the batch size. 
      This should match the dimensions of the video frames being fed into the model.

    Returns:
    - keras.Model: The compiled Keras model ready for training.
    """
    sequence_length = MAX_SEQ_LENGTH
    embed_dim = NUM_FEATURES
    dense_dim = 4
    num_heads = 1
    classes = len(label_processor.get_vocabulary())


    # inputs = keras.Input(shape=shape)
    inputs = keras.Input(shape=(20,2048))
    # Debugging Code
    print("Shape before Positional Embedding:", inputs.shape)
    x = PositionalEmbedding(
        sequence_length, embed_dim, name="frame_position_embedding"
    )(inputs)
    x = TransformerEncoder(embed_dim, dense_dim, num_heads, name="transformer_layer")(x)
    x = layers.GlobalMaxPooling1D()(x)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(classes, activation="softmax")(x)
    model = keras.Model(inputs, outputs)

    model.compile(
        optimizer="adam",
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )
    return model

In [None]:
def run_experiment():
    """
    Purpose: Execute the training experiment for the video classification model.

    This function handles the workflow of training a video classification model.
    It sets up a checkpointing system to save the best model weights, compiles the model,
    fits the model on the training data with validation, and evaluates the model on test data.
    It prints out the test accuracy and returns the trained model.

    Returns:
    - keras.Model: The trained Keras model after loading the best weights.

    Notes:
    - This function assumes that `get_compiled_model`, `train_data`, `train_labels`, 
      `test_data`, and `test_labels` are available in the scope and that `train_data` 
      and `test_data` are numpy arrays with the first dimension being the batch size 
      and the remaining dimensions matching the expected input shape of the model.
    - `EPOCHS` is a constant that defines how many epochs the model will be trained for.
    - The model's weights are saved to a temporary file path which is hardcoded in the function.

    """

    filepath = "/tmp/video_classifier.weights.h5" #Todo Update to video path
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    model = get_compiled_model(train_data.shape[1:])
    history = model.fit(
        train_data,
        train_labels,
        validation_split=0.15,
        epochs=EPOCHS,
        callbacks=[checkpoint, tensorboard_callback, earlystopping_callback],
    )

    model.load_weights(filepath)
    _, accuracy = model.evaluate(test_data, test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return model

In [None]:
# Hyperparamters 
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 5

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

In [None]:
# Execute on dummy

model = get_compiled_model((MAX_SEQ_LENGTH, NUM_FEATURES, 3))
model.summary()

### Model Summary Notes:

**Input Layer**: The model accepts input of shape (None, 20, 2048), where "None" can be any batch size, "20" is the sequence length, and "2048" is the feature dimension per sequence element.

**Positional Embedding Layer**: This layer adds positional information to the input, outputting the same shape (None, 20, 2048) and has 40,960 parameter to ensure the model is learning  unique embedding for each position.

**Transformer Layer**: The Transformer encoder layer maintains the shape (None, 20, 2048) with no change in sequence length or feature dimensionality, and has a 16,812,036 parameters. This layer is used for complex transformation to capture interactions between different positions in the sequence.

**Global Max Pooling Layer**: Following the Transformer layer, a GlobalMaxPooling1D layer condenses the information across the sequence length from (None, 20, 2048) to (None, 2048), taking the maximum value over the sequence for each feature.

**Dropout Layer**: A dropout layer with no change in shape (None, 2048) is used to *prevent overfitting* during training by randomly setting a portion of input units to 0 at each update during training time.

**Dense Layer**: The final output layer is a dense layer with 400 units and a softmax activation to allow the model to performing classification into 400 categories. It contains 819,680 parameters.

**Total Parameters**: The model has a total of 17,672,596 parameters, all of which are trainable. This is a large model with significant capacity for learning complex patterns in the data.

**No Non-trainable Parameters**: There are 0 non-trainable parameters, which means all the parameters in the model are being updated during training.

In [None]:
trained_model = run_experiment()