<a href="https://colab.research.google.com/github/SaranBo/SELAB/blob/main/CSM_Video_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, LSTM, GRU
from keras.applications import ResNet50
import cv2
import numpy as np


We shall use the function extract_frames to take video files as input, extract frames from the video at a specific interval (e.g., one frame per second).
and finally resize frames to a uniform dimension suitable for the neural network.

In [2]:
def extract_frames(video_path, interval=1):
    """
    Extracts frames from a video at a specified interval.

    Args:
    - video_path (str): The path to the video file.
    - interval (int): Interval in seconds at which to extract frames.

    Returns:
    - List of frames (as numpy arrays).
    """

    #input
    cap = cv2.VideoCapture(video_path)

    if not cap.isOpened():
        print("Error: Could not open video.")
        return []

    frames = []
    fps = int(cap.get(cv2.CAP_PROP_FPS))  # Fps of video
    frame_count = 0

    while True:
        ret, frame = cap.read()

        # if no frame break loop
        if not ret:
            break

        # extract frame if at interval specified
        if frame_count % (fps * interval) == 0:
            frames.append(frame)

        frame_count += 1

    # release video capture object
    cap.release()

    return frames


Now we shall define a function called preprocess_frames which shall

Iterates through each frame in the given list of frames.

Resizes each frame to the specified target_size. Here, cv2.INTER_AREA interpolation is used, which is effective for image shrinking.

Normalizes the pixel values of each frame to the range [0, 1] by dividing by 255.0. This normalization is a common practice when working with neural
networks, as it helps to stabilize the training process.

Appends the processed frame to a new list, which is then converted to a numpy array before being returned.


In [4]:
import cv2
import numpy as np

def preprocess_frames(frames, target_size=(224, 224)):
    """
    Preprocesses video frames by resizing and normalizing them.

    Args:
    - frames (list): A list of frames (as numpy arrays).
    - target_size (tuple): The target size for resizing frames, default is (224, 224).

    Returns:
    - Preprocessed frames as a numpy array.
    """

    preprocessed_frames = []

    for frame in frames:
        # Resize frame
        resized_frame = cv2.resize(frame, target_size, interpolation=cv2.INTER_AREA)

        # Normalize pixel values to be between 0 and 1
        normalized_frame = resized_frame / 255.0

        preprocessed_frames.append(normalized_frame)

    # Convert list of frames to a numpy array
    preprocessed_frames = np.array(preprocessed_frames)

    return preprocessed_frames


The build_rcnn_feature_extractor function constructs a Recurrent Convolutional Neural Network (RCNN) for video processing by combining a pre-trained Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) layers. It starts with ResNet50, a widely-used CNN model pre-trained on ImageNet, for extracting spatial features from video frames. These features are then fed into RNN layers, encapsulated within a TimeDistributed wrapper, allowing the model to process a sequence of frames and capture temporal dynamics. This hybrid approach of spatial and temporal feature extraction makes it ideal for tasks like video classification or summarization, and it can be seamlessly integrated and executed in a Google Colab environment, leveraging TensorFlow and Keras libraries

In [7]:
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.models import Model

def build_feature_extractor():
    """
    Builds a feature extractor using a pre-trained CNN (ResNet50).

    Returns:
    - A Keras model that takes an image as input and outputs feature vectors.
    """

    # Load the pre-trained ResNet50 model
    base_model = ResNet50(weights='imagenet', include_top=False)

    # Create a custom model that outputs features from the pre-trained model
    # We use the output of the last convolutional layer
    output = base_model.layers[-1].output

    # Construct the new model
    feature_extractor = Model(inputs=base_model.input, outputs=output)

    return feature_extractor


In [9]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

def build_sequence_model(input_shape):
    """
    Builds a sequence model using RNN (LSTM).

    Args:
    - input_shape (tuple): The shape of the input features (output of CNN).

    Returns:
    - A Keras model designed for sequence processing.
    """

    model = Sequential()

    # Adding an LSTM layer
    # You can tweak the number of units and other parameters as needed
    model.add(LSTM(256, input_shape=input_shape, return_sequences=False))

    # Adding a Dense layer for some additional processing
    # You can modify the number of units or add more layers as required
    model.add(Dense(128, activation='relu'))

    # Output layer
    # The output dimension depends on the subsequent use-case (e.g., text generation)
    model.add(Dense(64, activation='relu'))

    return model


In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

def build_text_generator(input_shape):
    """
    Builds a text generator model.

    Args:
    - input_shape (tuple): The shape of the input vector (output of sequence model).

    Returns:
    - A Keras model designed for text generation.
    """

    model = Sequential()

    # Input layer
    model.add(Dense(256, activation='relu', input_shape=input_shape))
    model.add(Dropout(0.3))

    # Hidden layers
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.3))

    # Output layer
    # The number of units in the output layer should correspond to the size of your text vocabulary
    # For simplicity, let's assume a hypothetical vocabulary size of 1000
    model.add(Dense(1000, activation='softmax'))

    return model


In [12]:
def generate_summary(video_path):
    """
    Generates a textual summary for a given video using advanced methods with integrated sequence modeling.

    Args:
    - video_path (str): The path to the video file.

    Returns:
    - The generated textual summary as a string.
    """

    # Step 1: Extract and preprocess frames from the video
    frames = extract_frames(video_path)
    preprocessed_frames = preprocess_frames(frames)

    # Step 2: Feature Extraction
    feature_extractor = build_feature_extractor()
    features = feature_extractor.predict(preprocessed_frames)

    # Flatten the features to simplify the example
    flattened_features = features.reshape((features.shape[0], -1))

    # Step 3: Sequence Modeling
    # Utilizing the build_sequence_model function for sequence modeling
    sequence_model = build_sequence_model(flattened_features.shape[1:])
    video_representation = sequence_model.predict(flattened_features[np.newaxis, ...])

    # Step 4: Text Generation
    # Use a pre-trained language model for text generation
    # For demonstration, let's use GPT-2 model via Hugging Face's transformers library
    summarizer = pipeline("text-generation", model="gpt2")
    summary = summarizer(video_representation.flatten().tolist(), max_length=50, min_length=25, do_sample=False)[0]['generated_text']

    return summary

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
from google.colab import files
uploaded = files.upload()

Saving Wildlife Windows 7 Sample Video.mp4 to Wildlife Windows 7 Sample Video.mp4


In [16]:
video_path = '/content/Wildlife Windows 7 Sample Video.mp4'  # Path to the uploaded video file
summary = generate_summary(video_path)
print(summary)




ValueError: ignored