<a href="https://colab.research.google.com/github/Alimumtaz95/Q2_assignments/blob/main/Assignment_6_Exploring_LLM_Models_for_Creative_Video_Generation_and_Script_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##First we will install some important linbraries##
1. **diffusers:** A library for building and using diffusion models, often used for image and video generation.
2. **transformers:** A library by Hugging Face for working with pre-trained language models (e.g., GPT, BERT).
3. **accelerate:** A library for optimizing and scaling AI training and inference across different hardware (e.g., CPU, GPU, multi-device setups).

In [None]:
!pip install -q diffusers transformers accelerate

##Setup for Video Generation with Diffusion Models##
This code initializes a diffusion-based video generation pipeline using the diffusers library. It imports PyTorch (torch) for tensor computations and GPU acceleration. The DiffusionPipeline class is used to load pre-trained diffusion models for generating media, while the DPMSolverMultistepScheduler optimizes the sampling process for faster and more efficient diffusion steps. The export_to_video utility helps convert the generated image frames into a video format, enabling smooth animations from the diffusion output.

In [None]:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

##Initialize a Text-to-Video Diffusion Pipeline
This code sets up a text-to-video generation pipeline using a pre-trained model from damo-vilab/text-to-video-ms-1.7b. The model operates with 16-bit floating point precision (fp16) for reduced memory usage and faster computation, enabled by setting torch_dtype=torch.float16. The DPMSolverMultistepScheduler is applied to optimize and accelerate the sampling process. The enable_model_cpu_offload() function ensures efficient memory management by offloading parts of the model to the CPU when not in active use, which is especially useful for devices with limited GPU memory.
###Requirements:

- PyTorch (with GPU and CUDA support)
- diffusers library
- torch_dtype=torch.float16 requires a GPU with CUDA capability ≥ 7.0 (e.g., NVIDIA Turing or later).

In [None]:
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

model.fp16.safetensors:   0%|          | 0.00/681M [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/2.82G [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

##Generate and Save a Text-to-Video Animation
This code generates a video based on the text prompt "A robot is cleaning a house" using the DiffusionPipeline. The pipeline produces a sequence of video frames (video_frames) with 25 inference steps. Each frame is processed as a NumPy array and converted to an 8-bit unsigned integer format (uint8) for compatibility with imageio, a library used to create and save videos. The video is saved to Google Drive at the specified path (/content/drive/MyDrive/output_video.mp4) with a frame rate of 8 FPS. After appending all frames, the video writer is closed, the filename is printed, and GPU memory is cleared using torch.cuda.empty_cache().

In [None]:
prompt = "A robot is cleaning a house"
video_frames = pipe(prompt, num_inference_steps=25).frames
import imageio # Import imageio
import numpy as np # Import numpy
from google.colab import drive
drive.mount('/content/drive')

video_path = '/content/drive/MyDrive/output_video.mp4' # Initialize video path


for i, frame in enumerate(video_frames[0]): # Access individual frames
    # frame is already a numpy array, no need to call cpu() and numpy()
    #frame = frame.cpu().numpy() # Move frame to CPU and convert to NumPy - This line is removed
    frame = (frame * 255).astype(np.uint8) # Convert to uint8 for imageio

    # Create or append to video file
    if i == 0:
        writer = imageio.get_writer(video_path, fps=8) # Initialize writer
    writer.append_data(frame)

writer.close()
video_name = video_path.split('/')[-1]
print("Name:", video_name)
torch.cuda.empty_cache()

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

##Install LangChain and Google Generative AI Integration
These commands install and upgrade key Python libraries for integrating LangChain with Google's Generative AI models:

1. langchain: A library for building applications powered by language models, enabling chaining multiple LLM calls, data processing, and workflow management.
2. google-genai: A library for interacting with Google's Generative AI models (e.g., Gemini).
3. langchain-google-genai: A bridge library that integrates Google GenAI models into LangChain workflows, simplifying the process of using Google’s LLMs in LangChain chains and agents.

In [None]:
!pip install -q -U langchain
!pip install -q -U google-genai
!pip install -q -U langchain-google-genai

##Set Up Google Gemini API Key in Colab
This code securely retrieves and sets the Google Gemini API Key from Google Colab's userdata storage. The userdata.get('Gemini_API_Key') function fetches the API key, which must have been previously stored in Colab's secure environment. The key is then assigned to the Gemini_API_Key environment variable via os.environ. This approach ensures that the API key remains secure and hidden from direct exposure in the notebook code, enabling safe authentication for interacting with Google's Generative AI services.

In [None]:
from google.colab import userdata
import os
os.environ["Gemini_API_Key"] = userdata.get('Gemini_API_Key')

##Initialize Google Generative AI with LangChain Integration
This code imports essential modules to integrate Google's Generative AI models with LangChain. The ChatGoogleGenerativeAI class from langchain_google_genai enables communication with Google's chat-based Generative AI models, simplifying their use in LangChain workflows. The Content and Part types from google.genai.types define structured message formats for input and output when interacting with the models. Lastly, Markdown and display from IPython.display allow the rendering of model outputs in a formatted Markdown style within the notebook interface, enhancing readability for text-based responses.

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from google.genai.types import Content , Part
from IPython.display import Markdown,display

##Initialize Gemini Model with LangChain Integration
This code sets up an instance of the Gemini-2.0-Flash-Exp model using the ChatGoogleGenerativeAI class from LangChain. The model parameter specifies the desired Gemini model version, optimized for fast and efficient responses. The api_key parameter securely retrieves the Google Gemini API key from Colab's userdata. The temperature parameter, set to 0.5, controls the creativity and randomness of the model's responses—lower values make outputs more deterministic, while higher values make them more diverse. This configuration allows for smooth integration of Google's Generative AI with LangChain workflows.

In [None]:
llm = ChatGoogleGenerativeAI(
    model = "gemini-2.0-flash-exp",
    api_key= userdata.get("Gemini_API_Key"),
    temperature=0.5
)

### **Upload and Process Video with Google Gemini API**  

This code defines a function, `upload_video`, to **upload and process a video file** using the **Google Gemini API** via the `genai` client. The `client` is initialized with the **Gemini API key** securely retrieved from `userdata`.  

- **Upload Video:** The video file at `video_file_name` is uploaded using `client.files.upload`.  
- **Processing State:** A `while` loop checks the video processing state every **10 seconds** using `time.sleep(10)`. It waits until the state changes from `"PROCESSING"`.  
- **State Handling:** If the state becomes `"SUCCESS"`, the function proceeds. If it changes to `"FAILED"`, an exception is raised.  
- **Confirmation:** Upon successful processing, the video’s URI is printed, indicating that it’s ready for further use.  

**Key Parameter:**  
- `video_file_name`: Path to the video file (`/content/drive/MyDrive/output_video.mp4`).  

This ensures efficient handling of video uploads and state tracking during processing.

In [None]:
import time
from google import genai

client = genai.Client(api_key=userdata.get("Gemini_API_Key"))
output_video = "/content/drive/MyDrive/output_video.mp4"

def upload_video(video_file_name):
  video_file = client.files.upload(path = video_file_name)
  while video_file.state == "PROCESSING":
    print("Video Is Being Processed , Kindly Wait!")
    time.sleep(10)
    video_file = client.files.get(name = video_file.name or "")
  if video_file.state == "SUCCESS":
    pass
  elif video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + (video_file.uri or ""))
  return video_file

The code uploads a video file (stored in the variable `output_video`) to a specified location or service.

In [None]:
output_video = upload_video(output_video)

Video Is Being Processed , Kindly Wait!
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/zqhey9sl6m9q


##Video Analysis Using LangChain to Generate Captions and Timecodes
The `analyze_video_with_langchain` function uses LangChain's `gemini-2.0-flash-exp` model to analyze a video file. It starts by defining a prompt template that instructs the model to generate captions for each scene in the video, including any spoken text, and associate each caption with the corresponding timecode. The function then constructs a request containing two content objects: one with the video file’s URI and MIME type, and another with the prompt template. The request is sent to the model, which processes the video and generates the desired captions. Finally, the function prints a header and displays the model’s textual response in Markdown format, making it easy to read and interpret the video analysis results.

In [None]:
def analyze_video_with_langchain(video_file):
  prompt_template = """
  For each scene in this video,
  generate captions that describe the scene along with any spoken text placed in quotation marks.
  Place each caption into an object with the timecode of the caption in the video.
         """

  response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        Content(
            role="user",
            parts=[
                Part.from_uri(
                    file_uri=video_file.uri or "",
                    mime_type=video_file.mime_type or ""),
            ]),
        Content(
            role="user",
            parts=[Part(text=prompt_template)]
        )
    ]
  )
  print("Video Analysis:")
  display(Markdown(response.text))

In this code snippet, the variable `output_video` is assigned to `video`, and then the function `analyze_video_with_langchain(video)` is called. This means that the `output_video` (which is presumably a video file or object) is passed as an argument to the `analyze_video_with_langchain` function, triggering the analysis of the video for caption generation and timecode association. The function will process the video as described earlier, generating captions for each scene, including any spoken text, and displaying the results.

In [None]:
video = output_video
analyze_video_with_langchain(video)

Video Analysis:


```json
[
  {
    "timecode": "00:00",
    "caption": "A white robot with black accents and two small screens on its face stands on a tile floor. A robotic arm with a green brush is partially visible on the right side of the frame."
  },
  {
    "timecode": "00:01",
     "caption": "The robotic arm with a green brush is now fully visible. It is touching the tile floor."
   }
]
```
