# Project 2: First approaches to MultiModal Transformers: Bridging Text with Vision, Audio, and Video

# Objective: 

Instead of treating text, audio, and video as separate streams of information, you will design a **Transformer-based model that intelligently fuses two modalities**—text with images, text with audio, or text with video. Your challenge is to harness the power of deep learning to create a system where each modality enhances the other, unlocking richer, more meaningful insights.

This is more than just training a model—it’s about innovation. How will you design a fusion strategy that truly captures cross-modal relationships? Will your model generate creative text from images, answer questions from audio, or retrieve videos based on descriptions? The decisions are yours to make.

Even if you and your peers work with similar datasets, your approach must be unique. Whether through data choices, architectural modifications, or fusion techniques, your model should push the boundaries of multimodal AI. Experiment boldly, optimize strategically, and most importantly—create something exciting.



# Deliverables:

- A working model (hybrid architecture)
- A structured report (including visuals & reflections) **Required: Include details about your hybrid architecture!!!!**
- A GitHub repository with clean, documented code
---

# Step 1: Select your own adventure

Below is a concise, high-level breakdown of three main multimodal “adventures,” each with several task options and brief notes about potential datasets and implementation tips. This structure makes it easy to pick a project that best fits your interests and available resources—whether you prefer images, audio, or video combined with text. **Required in your reports regardless of the choice picked: Include details about your hybrid architecture!!!!**

## **Choice 1: Images + Text**

### 1. **Image Captioning**
- **Goal:** Automatically generate textual descriptions (captions) for given images.  
- **Potential Datasets:**  
  - **MS COCO** – Large-scale, ~330k images with multiple captions per image.  
  - **Flickr8k/30k** – Smaller datasets; useful for quick iteration.  
- **Implementation Tips:**  
  - Use a **CNN or Vision Transformer** to encode images, then a Transformer decoder for generating text.  
  - Evaluate output with **BLEU, METEOR, or CIDEr**.

### 2. **Visual Question Answering (VQA)**
- **Goal:** Answer open-ended questions about image content (e.g., “How many dogs are in this picture?”).  
- **Potential Datasets:**  
  - **VQA v2** – 204k images and ~1 million Q&A pairs.  
  - **GQA** – Emphasizes compositional reasoning.  
- **Implementation Tips:**  
  - Fuse **image features** (from a CNN/ViT) with **question embeddings** (Transformer for text).  
  - Evaluate with **accuracy** for classification-based answers or **language metrics** for open-ended answers.


### 3. **Image-Text Retrieval**
- **Goal:** Retrieve the most relevant images given a text query, or vice versa.  
- **Potential Datasets:**  
  - **MS COCO** – Commonly used for both captioning and retrieval.  
  - **Flickr30k** – Includes structures suited to retrieval tasks.  
- **Implementation Tips:**  
  - Use **dual encoders** for image and text, trained with a **contrastive loss** to align modalities.  
  - Evaluate with **Recall@K** or **mean rank** metrics.

---

## **Choice 2: Audio + Text**

### 1. **Speech Recognition**
- **Goal:** Convert spoken language (waveforms) into written text transcripts.  
- **Potential Datasets:**  
  - **LibriSpeech** – ~1,000 hours of English audiobook recordings.  
  - **Mozilla Common Voice** – Crowd-sourced, multilingual speech data.  
- **Implementation Tips:**  
  - Convert waveforms into **Mel spectrograms**, or use **wav2vec2** (pretrained).  
  - Evaluate with **Word Error Rate (WER)**.

### 2. **Audio-Text Alignment**
- **Goal:** Match spoken words or segments in an audio file to their written transcripts (often down to timestamps).  
- **Potential Datasets:**  
  - **TEDLIUM** – TED talks with aligned transcripts.  
  - **YouTube** auto-transcripts (though noisier).  
- **Implementation Tips:**  
  - Segment audio frames; align with text tokens.  
  - Use **CTC-based** approaches or techniques like Dynamic Time Warping (DTW).  
  - Applications: **karaoke-style** subtitles, real-time captioning.


### 3. **Spoken Command Classification**
- **Goal:** Identify short, predefined voice commands like “Turn on the light.”  
- **Potential Datasets:**  
  - **Google Speech Commands** – Tens of thousands of short utterances for specific commands.  
- **Implementation Tips:**  
  - A **classification task** (label each audio clip with the intended command).  
  - Evaluate with **accuracy** or **F1 score**.

---

## **Choice 3: Video + Text**

### 1. **Video Captioning**
- **Goal:** Generate textual descriptions for short videos (e.g., “A person cooking pasta”).  
- **Potential Datasets:**  
  - **MSR-VTT** – ~10k short video clips, each with multiple captions.  
  - **YouCook2** – Cooking videos with detailed instructions.  
- **Implementation Tips:**  
  - Sample frames (e.g., 1 fps) for each video.  
  - Encode frames (CNN/ViT) and use a Transformer decoder for text.  
  - Evaluate with **BLEU, METEOR, or CIDEr**.


### 2. **Video Question Answering (Video QA)**
- **Goal:** Answer questions based on video content (objects, actions, context).  
- **Potential Datasets:**  
  - **TVQA** – TV show clips plus questions about dialogue and visuals.  
  - **LSMDC** – Movie clips with descriptions/questions.  
- **Implementation Tips:**  
  - Extract **visual features** from sampled frames; optionally include **subtitles/transcripts**.  
  - Fuse them with question embeddings in a multimodal Transformer.  
  - Evaluate with **accuracy** or open-ended **language metrics**.


### 3. **Text-Based Video Retrieval**
- **Goal:** Find relevant video clips from a database based on a text query (e.g., “Videos of someone playing guitar”).  
- **Potential Datasets:**  
  - **MSR-VTT** – Contains clips plus textual metadata.  
  - **ActivityNet Captions** – Videos with temporal captions.  
- **Implementation Tips:**  
  - Use **dual encoders** or a **joint embedding** space.  
  - Evaluate with **Recall@K**, **MRR** (Mean Reciprocal Rank), or similar retrieval metrics.

---


## **General Reccomendations**


#### 1. **Transformer Architecture**

- **Separate Encoders:** Build one encoder for text and another for your chosen modality. Fuse the resulting embeddings either through cross-attention or by concatenating them, then feeding them into further layers.

- **Learned Modality Embeddings:** Introduce special learned tokens (e.g., [IMAGE], [AUDIO], [VIDEO]) to flag which modality a token or embedding belongs to. This can help the Transformer distinguish between, say, a text token vs. an image patch embedding.

- **Cross-Attention:** If you’re using an encoder–decoder structure (common for generation tasks like captioning), the decoder can attend to both text representations and other modality representations. This is especially potent if your final output is text (e.g., describing an image or transcribing an audio snippet).

- **Positional or Spatial Embeddings:**
Images/Videos: 2D positional embeddings to capture spatial layout.
Audio: Time–frequency positional embeddings to reflect temporal progression.
Text: Standard 1D positional embeddings or relative positioning can suffice.

#### 2. **Fusion Strategy for Multimodality**

- **Concatenation:** The simplest method—just stack text embeddings and modality embeddings along the sequence dimension. Make sure each chunk has a clear positional signal.

- **Attention-based Fusion:**
Let each modality have its own encoder.
Combine them via cross-attention in later layers, where the text representation attends to the image/audio/video representation or vice versa.
You might even try mutual cross-attention for an even richer representation.

- **Late Fusion**: Encode each modality separately, then merge the final embeddings (e.g., by averaging, concatenation, or a learnable projection) to feed into a classification or decoding head.

#### 3. **Training Loop and Objective**

- **Loss Functions**
Text Generation (e.g., captioning): Cross-entropy on the predicted tokens.
Classification (VQA, spoken command classification): Cross-entropy or binary cross-entropy.
Retrieval (matching text to images/ audio/video): Contrastive or triplet loss.

- **Masking**
Carefully handle [PAD] tokens so the attention mechanism ignores those placeholders. Use key padding masks in PyTorch for both the source and target.

- **Training Details**
Use AdamW or a similar optimizer with a suitable learning rate scheduler (e.g., warmup + decay).
Watch your GPU memory usage. If your model or data is large, consider gradient checkpointing or reduce batch size.

---

## **Summary**
Each **Choice** (Images + Text, Audio + Text, or Video + Text) comes with **three distinct tasks** of escalating complexity. Select the modality and task that excite you most and that fit your available computing resources. Focus on building a solid **data pipeline**, leveraging **pretrained models**, and performing **continuous evaluation** to ensure tangible progress over your project timeline. 

### **You will need to research some of the approaches reccomended here, but, believe me, that is the way real world works! Frustration is always allowed!**


---

## Clarification

You **don't** need to develop an interactive application for this project. The demo will serve as a platform to communicate your results.

# Step 2: Submit Your Work

Your submission package should include:

1. **GitHub Repository** (Well-documented code). ``add`` and ``commit`` the final version of your work, and ``push`` your code to your GitHub repository. You can have multiple notebooks. It is up to you.
2. **Project Report** – 4-page IEEE-format paper. Write a paper with no more than 4 pages addressing the architecture, tasks outcomes and reflections. When writing this report, consider a business-oriented person as your reader (e.g. your PhD advisor, your internship manager, etc.). Tell the story for each datasets' goal and tasks covered. **Required: Include details about your hybrid architecture!!!!** Also, include insights about:
- Significance of your implementation
- Accuracy, loss curves, feature importance.
- What worked, what didn’t, what’s next?
- Where could this be applied?

3. **Demo Link or Video** (Showcasing your model’s workflow)
4. **README.md file.** Edit the readme.md file in your repository and how to use your code. Ensure reproducibility (environment requirements, requirements.txt, or environment.yml)


**``Submit the URL of your GitHub Repository as your assignment submission on Canvas.``**

