# Improving Long-Range Multimodal Video Storytelling Using Transformer-Based Architectures

## Introduction

This project is about generating stories from videos using both images and text. A common method is to use CNN models to extract image features and LSTM models to understand the sequence over time.

LSTM-based models work well for short videos but face problems when the video is long. Important information from earlier frames can be lost, which results in incomplete or unclear captions.

To solve this problem, this project uses a Transformer-based model. Transformers use attention to look at all frames together, which helps the model understand long videos and produce more meaningful stories.

In [None]:
## Dataset

This project uses the StoryReasoning Dataset. The dataset contains sequences of images along with their corresponding text descriptions, which together form a visual story.

Each sequence is treated as a video, and the task is to predict the next part of the story based on previous images and text. To avoid data leakage, the data is split at the video level.

- Training data: 80%
- Validation data: 10%
- Test data: 10%

## Baseline Model

The baseline model follows a traditional approach for video storytelling. It uses a CNN model to extract features from each image and an LSTM model to process the sequence over time.

The CNN helps in understanding the visual content of each frame, while the LSTM captures the order of frames in the video. However, this model has difficulty handling long video sequences, as the LSTM may forget important information from earlier frames.

## Proposed Transformer-Based Model

To improve performance on long video sequences, this project uses a Transformer-based model instead of an LSTM-based model.

The Transformer uses an attention mechanism that allows it to focus on all frames in the video at the same time. This helps the model remember important information from earlier frames and understand long-range relationships in the story.

By using this approach, the model is expected to generate more complete and meaningful captions for long videos.

## Experimental Setup

The performance of the proposed Transformer-based model is compared with the baseline LSTM-based model.

The dataset is divided into training, validation, and test sets. The training set is used to learn model parameters, the validation set is used to monitor performance during training, and the test set is used for final evaluation.

To evaluate the models, standard captioning metrics are used along with qualitative analysis to check how well the generated stories make sense over long video sequences.

In [1]:
import numpy as np
import matplotlib.pyplot as plt