<a href="https://colab.research.google.com/github/Tarandeep97/Video-Chatbot/blob/main/videochat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip -q install git+https://github.com/huggingface/transformers accelerate flash_attn
!pip -q install qwen_vl_utils av
!pip -q install git+https://github.com/juanbindez/pytubefix.git
!pip -q install moviepy
!pip -q install datasets

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
from pytubefix import YouTube

# Function to download Youtube video
def download_video(youtube_url, output_path='videos/', filename=None):
    try:
        yt = YouTube(youtube_url)
        stream = yt.streams.filter(progressive=True, file_extension='mp4') \
                           .order_by('resolution') \
                           .desc() \
                           .first()


        stream.download(output_path=output_path, filename=filename)
        print(f"Downloaded: {filename if filename else stream.default_filename} at {output_path}")
    except Exception as e:
        print(f"An error occurred: {e}")


video_url = 'https://www.youtube.com/watch?v=L3374C3OyrY'
download_video(video_url, filename='vid.mp4')

Downloaded: vid.mp4 at videos/


In [4]:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Qwen/Qwen2-VL-2B-Instruct"

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto")

processor = AutoProcessor.from_pretrained(model_name)

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
questions = [
    "Explain the technique to kick the ball?",
    "What is the best defensive move shown in the video?",
    "Describe the player's strategy for winning the match?"
]

messages = {
    "role": "user",
    "content": [
        {
            "type": "video",
            "video": "/content/videos/vid.mp4",
            "max_pixels": 360 * 420,
            "fps": 1.0,
        }
    ]
}

q_and_a_list = []

for question in questions:
    message_with_question = messages.copy()
    message_with_question["content"].append({
        "type": "text",
        "text": question
    })

    # Prepare input for the model
    text = processor.apply_chat_template([message_with_question], tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info([message_with_question])
    inputs = processor(text=[text], videos=video_inputs, return_tensors="pt", padding=True).to("cuda")

    # Run inference and get output
    output_ids = model.generate(**inputs, max_new_tokens=512)
    output_text = processor.batch_decode(output_ids, skip_special_tokens=True)

    # Store the question and its corresponding answer
    q_and_a_list.append({
        "q": question,
        "a": output_text[0].splitlines()[-1]  # Format answer properly
    })

# Display the formatted output as Question-Answer pairs
for idx, q_and_a in enumerate(q_and_a_list):
    print(f"Q{idx + 1}: {q_and_a['q']}")
    print(f"A{idx + 1}: {q_and_a['a']}\n")


qwen-vl-utils using torchvision to read video.


Q1: Explain the technique to kick the ball?
A1: The person in the video demonstrates the technique of kicking the ball by placing their foot on the ball and then kicking it with their other foot. This is a common technique used in soccer to control the ball and make precise passes.

Q2: What is the best defensive move shown in the video?
A2: The video showcases a technique where the player kicks the ball with their foot, creating a splash of water. The best defensive move shown in the video is the "tackle" or "tack" move, where the player uses their body to push the ball away from their opponent. This move is crucial in soccer to prevent the opponent from advancing and scoring.

Q3: Describe the player's strategy for winning the match?
A3: The video showcases a player kicking a soccer ball with a technique that involves the use of the ball's surface to create a powerful kick. The player's feet are positioned in a way that the ball is kicked with a high arc, creating a powerful shot. Th

In [6]:
from datasets import load_dataset
msrvtt = load_dataset("AlexZigma/msr-vtt")

In [7]:
def get_caption_video_data(data):
    captions = [entry['caption'] for entry in data]
    urls = [entry['url'] for entry in data]
    return captions, urls

captions, urls = get_caption_video_data(msrvtt['val'])

In [8]:
from transformers import BlipForConditionalGeneration, BlipProcessor, CLIPProcessor, CLIPModel

blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

In [26]:
from datasets import load_dataset
import evaluate
import matplotlib.pyplot as plt
import os
from pytubefix import YouTube
from moviepy.video.io.VideoFileClip import VideoFileClip
from transformers import BlipForConditionalGeneration, BlipProcessor, CLIPProcessor, CLIPModel, Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

# Load the MSR-VTT dataset
msrvtt = load_dataset("AlexZigma/msr-vtt")

# Extract validation data
val_data = msrvtt['val']

# Prepare evaluation metrics
bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")

# Define the question list
questions = [
    "What is a suitable caption for the video?"
]

# Directory to store downloaded videos
output_directory = './validation_videos'
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Initialize BLIP and CLIP models
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to("cuda")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Initialize Qwen2-VL model and processor
qwen_model_name = "Qwen/Qwen2-VL-2B-Instruct"
qwen_model = Qwen2VLForConditionalGeneration.from_pretrained(qwen_model_name, torch_dtype="auto", device_map="auto")
qwen_processor = AutoProcessor.from_pretrained(qwen_model_name)

def download_and_clip_video(youtube_url, start_time, end_time, output_path='videos/', filename=None):
    try:
        yt = YouTube(youtube_url)
        stream = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first()

        video_file_path = os.path.join(output_path, filename if filename else stream.default_filename)
        stream.download(output_path=output_path, filename=filename)
        print(f"Downloaded: {filename if filename else stream.default_filename} at {output_path}")

        video = VideoFileClip(video_file_path).subclip(start_time, end_time)
        clipped_video_file_path = os.path.join(output_path, f"clipped_{filename}")
        video.write_videofile(clipped_video_file_path, codec='libx264')
        return clipped_video_file_path
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def extract_frame_from_video(video_file_path, time=1):
    try:
        video = VideoFileClip(video_file_path)
        frame_path = f"{video_file_path}_frame.jpg"
        video.save_frame(frame_path, t=time)
        return frame_path
    except Exception as e:
        print(f"An error occurred while extracting frame: {e}")
        return None

def generate_caption_blip(frame_path):
    image = Image.open(frame_path)
    inputs = blip_processor(images=image, return_tensors="pt").to("cuda")
    output = blip_model.generate(**inputs, max_length=20, num_beams=5)
    return blip_processor.decode(output[0], skip_special_tokens=True)

def generate_caption_qwen(image_path, questions):
    message_with_question = {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": image_path,
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {
                "type": "text",
                "text": questions[0]
            }
        ]
    }

    text = qwen_processor.apply_chat_template([message_with_question], tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info([message_with_question])
    inputs = qwen_processor(text=[text], videos=video_inputs, return_tensors="pt", padding=True).to("cuda")
    output_ids = qwen_model.generate(**inputs, max_new_tokens=512)
    return qwen_processor.batch_decode(output_ids, skip_special_tokens=True)[0]

# Placeholder for results
results = {
    "BLIP": [],
    "Qwen2-VL": []
}

# Limit the iteration to the first 5 videos
for i in range(5):
    video_url = val_data['url'][i]
    video_id = val_data['video_id'][i]
    start_time = val_data['start time'][i]
    end_time = val_data['end time'][i]

    video_file_path = download_and_clip_video(video_url, start_time, end_time, output_path=output_directory, filename=f"{video_id}.mp4")

    if video_file_path:
        frame_path = extract_frame_from_video(video_file_path)

        if frame_path:
            blip_caption = generate_caption_blip(frame_path)
            results["BLIP"].append(blip_caption)

            qwen_caption = generate_caption_qwen(frame_path, questions)
            results["Qwen2-VL"].append(qwen_caption)

# Assuming the ground truth captions are in the dataset
ground_truth_captions = [val_data['caption'][i] for i in range(5)]

# Evaluate BLIP and Qwen2-VL
bleu_score_blip = bleu.compute(predictions=results["BLIP"], references=ground_truth_captions)
meteor_score_blip = meteor.compute(predictions=results["BLIP"], references=ground_truth_captions)

bleu_score_qwen = bleu.compute(predictions=results["Qwen2-VL"], references=ground_truth_captions)
meteor_score_qwen = meteor.compute(predictions=results["Qwen2-VL"], references=ground_truth_captions)

# Output the results for BLIP and Qwen2-VL
print(f"BLIP - BLEU Score: {bleu_score_blip}")
print(f"BLIP - METEOR Score: {meteor_score_blip}")
print(f"Qwen2-VL - BLEU Score: {bleu_score_qwen}")
print(f"Qwen2-VL - METEOR Score: {meteor_score_qwen}")

# Display BLIP results
print("\nBLIP Generated Captions:")
for i, caption in enumerate(results["BLIP"]):
    print(f"Video {i + 1}: {caption}")

# Display Qwen2-VL results
print("\nQwen2-VL Generated Captions:")
for i, caption in enumerate(results["Qwen2-VL"]):
    print(f"Video {i + 1}: {caption}")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.