# From Video to Code: Automating Code Generation with ERNIE 4.5-VL

### 1. Project Overview

This tutorial details and implements an end-to-end workflow for the automated conversion of video-based programming tutorials into executable front-end web code. The core of this process involves leveraging a large multimodal model for the visual comprehension of video content, which is then used to generate structured code.

The workflow comprises the following key technical steps:
1.  **Video Ingestion**: Downloading the source video file from a specified URL.
2.  **Visual Information Extraction**: Pre-processing the video file by sampling frames at a fixed interval and encoding the image data.
3.  **Segmented Content Comprehension**: Grouping the extracted frame sequences into chunks and making parallel calls to the ERNIE 4.5-VL model API to obtain structured text descriptions for each video segment.
4.  **Aggregation and Code Generation**: Consolidating the descriptive text from all segments into a comprehensive summary of the entire video. This summary serves as the context for a final call to the ERNIE 4.5-VL model to generate the complete HTML code.
5.  **Result Presentation**: Saving the generated code to an HTML file and automatically opening it in a local browser for validation.

### 2. Core Technology: An Overview of ERNIE 4.5-VL

The core of this workflow is the ERNIE 4.5 Vision-Language Model (VLM), which possesses the following characteristics highly relevant to this task:

*   **Multimodal Heterogeneous MoE Architecture**: The model is pre-trained jointly on both text and visual modalities. This design enables it to effectively capture cross-modal information, which is critical for accurately understanding both the code text and the corresponding visual styles presented in the video frames.
*   **State-of-the-Art Visual Understanding**: ERNIE 4.5-VL demonstrates exceptional performance on benchmarks related to visual perception, document analysis, and chart comprehension. This capability is fundamental to the model's ability to accurately identify syntax, structure, and visual layouts from video frames.
*   **Open Source Protocol**: The ERNIE 4.5 model family is released under the Apache 2.0 license, facilitating its use in technical research and application development.

This tutorial utilizes the visual analysis and code generation capabilities of ERNIE 4.5-VL to perform the conversion task from unstructured video to structured code.
![ERNIE 4.5-VL Performance Comparison](ernie_4.5_vl_performance.png)

### 3. Environment and Dependency Installation

Executing this workflow requires the installation of Python libraries for video downloading, image processing, and API interaction.

Run the following cell to install all necessary dependencies.

In [1]:
# Run this cell to install all required libraries
!pip install yt-dlp opencv-python openai tqdm

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/


### 4. Library Imports and Parameter Configuration

This section imports the required Python modules and defines global parameters, including API credentials and the target video URL.

**Instructions:**
-   Replace the value of the `API_KEY` variable with your Baidu AI Studio Access Token.
-   The `VIDEO_URL` is preset with an example link. You may replace it with a URL to another front-end development tutorial for testing.

In [None]:
import os
import cv2
import time
import base64
import yt_dlp
import shutil
import webbrowser
from openai import OpenAI
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

# --- Configuration Center ---

# 1. API Credential Configuration
# !!! IMPORTANT: Replace with your own Baidu ERNIE API KEY !!!
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://aistudio.baidu.com/llm/lmapi/v3"
MODEL_NAME = "ernie-4.5-turbo-vl"

# 2. Task Parameter Configuration
VIDEO_URL = "https://www.bilibili.com/video/BV1xQ4y167xr/" # <-- Specify the target video URL here
MAX_CONCURRENT_REQUESTS = 4  # Number of concurrent API requests

### 5. Core Function Definitions

The various stages of the workflow are encapsulated into separate functions to promote modularity and code clarity.

- **`download_video`**: Downloads the video file from the specified URL.
- **`extract_frames`**: Reads the video file to extract and encode the image frame sequence.
- **`process_chunk_with_retry`**: Calls the model API to process the frame sequence of a single video chunk, including retry logic for network requests.
- **`aggregate_and_generate_webpage`**: Consolidates the analysis results from all chunks and calls the model API to generate the final web page code.

In [3]:
def download_video(video_url, output_dir="temp_video"):
    """
    Downloads a video from the given URL using yt-dlp.
    Parameters:
        video_url (str): The URL of the video.
        output_dir (str): A temporary directory to store the downloaded video.
    Returns:
        str: The local file path of the video if successful, otherwise None.
    """
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.makedirs(output_dir)
    print(f">>> Downloading video: {video_url}")
    
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best',
        'outtmpl': os.path.join(output_dir, 'video.%(ext)s'),
        'quiet': True,
        'no_warnings': True,
        'merge_output_format': 'mp4'
    }
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            ydl.download([video_url])
        for file in os.listdir(output_dir):
            if file.startswith("video"):
                print("‚úÖ Video downloaded successfully.")
                return os.path.join(output_dir, file)
        print("‚ùå Video download failed: Downloaded file not found.")
        return None
    except Exception as e:
        print(f"‚ùå Video download failed: {e}")
        return None

def extract_frames(video_path, interval_sec=1):
    """
    Extracts image frames from a video file at a specified interval.
    Parameters:
        video_path (str): The local path to the video file.
        interval_sec (int): The time interval in seconds for frame extraction.
    Returns:
        list: A list of frame chunks, where each chunk contains 30 Base64-encoded image strings.
    """
    print(">>> Processing video and extracting frames...")
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print("‚ùå Error: Could not open video file.")
        return []
        
    fps = cap.get(cv2.CAP_PROP_FPS)
    if fps == 0: fps = 30 

    chunk_frames, chunks, frame_count = [], [], 0
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break
        
        if frame_count % int(fps * interval_sec) == 0:
            height, width = frame.shape[:2]
            scale = 512 / height
            resized_frame = cv2.resize(frame, (int(width * scale), 512))
            
            _, buffer = cv2.imencode('.jpg', resized_frame, [int(cv2.IMWRITE_JPEG_QUALITY), 60])
            frame_b64 = base64.b64encode(buffer).decode('utf-8')
            chunk_frames.append(frame_b64)
            
            if len(chunk_frames) == 30:
                chunks.append(chunk_frames)
                chunk_frames = []
        frame_count += 1
    
    if chunk_frames:
        chunks.append(chunk_frames)
        
    cap.release()
    print(f"‚úÖ Frame extraction complete. Video divided into {len(chunks)} chunks.")
    return chunks

def process_chunk_with_retry(client, chunk_index, frames_b64, max_retries=3):
    """
    Processes a single video chunk using the ERNIE API with a retry mechanism.
    Parameters:
        client (OpenAI): The API client instance.
        chunk_index (int): The index of the chunk.
        frames_b64 (list): A list of Base64-encoded image strings for the chunk.
        max_retries (int): The maximum number of retry attempts.
    Returns:
        tuple: A tuple containing the chunk index and the model's text description.
    """
    prompt = f"This is a segment from a web development video tutorial (one screenshot per second). Focus on the code on screen and the final web page style. Describe in detail the HTML structure, CSS styling code, or JS interaction logic shown in this segment."
    
    content = [{"type": "text", "text": prompt}]
    for f in frames_b64:
        content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{f}"}})
        
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=MODEL_NAME,
                messages=[{"role": "user", "content": content}],
                temperature=0.1, max_tokens=1024
            )
            return chunk_index, response.choices[0].message.content
        except Exception as e:
            if attempt == max_retries - 1:
                tqdm.write(f"‚ùå Chunk {chunk_index+1} failed after {max_retries} attempts: {e}")
            time.sleep(2)
    return chunk_index, ""

def aggregate_and_generate_webpage(client, summaries):
    """
    Aggregates summaries from all chunks and calls the API to generate the final webpage code.
    Parameters:
        client (OpenAI): The API client instance.
        summaries (dict): A dictionary of chunk indices and their text descriptions.
    Returns:
        str: The complete generated HTML code.
    """
    print("\n>>> Aggregating content and generating final code...")
    full_summary = "\n".join([f"Summary of Segment {i+1}: {s}" for i, s in sorted(summaries.items()) if s])
    
    final_prompt = f"""
    You are an expert front-end engineer. Based on the following segmented summaries extracted from a programming video tutorial, write a single, complete HTML file to reproduce the final result shown in the video.

    **Segmented Video Summaries:**
    ---
    {full_summary}
    ---

    **Instructions:**
    1.  Your code must be strictly based on the HTML structure, CSS styles, and functionality mentioned in the summaries.
    2.  Ignore any descriptions in the summaries that are not related to programming.
    3.  If the summary information is incomplete, use your professional knowledge to reasonably complete the code to form a fully functional webpage.
    4.  All HTML and CSS (within a `<style>` tag) must be in a single file.
    5.  Return only the raw HTML code, starting with `<!DOCTYPE html>`, without any explanations or markdown tags.
    """
    
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": final_prompt}],
        stream=True, temperature=0.2, top_p=0.8
    )
    
    html_code = ""
    print(">>> The model is now generating code:\n")
    for chunk in response:
        if hasattr(chunk.choices[0].delta, "content") and chunk.choices[0].delta.content:
            char = chunk.choices[0].delta.content
            print(char, end="", flush=True)
            html_code += char
    return html_code

### 6. Main Program Execution

Running the following cell will initiate the entire automated workflow. The script will use the `VIDEO_URL` specified in the configuration section and execute all defined functions in sequence.

In [4]:
def main():
    """The main execution function that orchestrates the entire workflow."""
    if "YOUR_ERNIE_API_KEY" in API_KEY or not API_KEY:
        print("‚ùå Error: Please return to the configuration cell (step 4) and set your API_KEY correctly.")
        return

    # Initialize the API client
    client = OpenAI(api_key=API_KEY, base_url=BASE_URL)
    
    # Steps 1 & 2: Download and extract frames
    video_path = download_video(VIDEO_URL)
    if not video_path: return
    
    chunks = extract_frames(video_path)
    if not chunks: return
    
    # Step 3: Concurrently process all chunks
    print(f">>> Starting concurrent processing of {len(chunks)} video chunks (Concurrency: {MAX_CONCURRENT_REQUESTS})...")
    chunk_summaries = {}
    
    with ThreadPoolExecutor(max_workers=MAX_CONCURRENT_REQUESTS) as executor:
        future_to_chunk_index = {executor.submit(process_chunk_with_retry, client, i, chunk): i for i, chunk in enumerate(chunks)}
        
        for future in tqdm(as_completed(future_to_chunk_index), total=len(chunks), desc="Processing Chunks"):
            idx, summary = future.result()
            if summary:
                chunk_summaries[idx] = summary
                
    if not chunk_summaries:
        print("\n‚ùå Failed to process all video chunks. Cannot generate final code.")
        return
        
    # Step 4: Aggregate content and generate webpage
    html_code = aggregate_and_generate_webpage(client, chunk_summaries)
    
    # Step 5: Save and open the result
    output_file = "final_result.html"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(html_code)
    
    print(f"\n\n>>> Workflow finished! Webpage saved as: {output_file}")
    abs_path = os.path.abspath(output_file)
    webbrowser.open(f"file://{abs_path}")
    print(f">>> Opening the result page in your default browser...")
    
    # Clean up temporary files
    shutil.rmtree("temp_video", ignore_errors=True)
    print(">>> Temporary files have been cleaned up.")

# --- Run the main program ---
main()

>>> Downloading video: https://www.bilibili.com/video/BV1xQ4y167xr/
‚úÖ Video downloaded successfully.                           
>>> Processing video and extracting frames...
‚úÖ Frame extraction complete. Video divided into 10 chunks.
>>> Starting concurrent processing of 10 video chunks (Concurrency: 4)...


Processing Chunks: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [01:09<00:00,  6.95s/it]



>>> Aggregating content and generating final code...
>>> The model is now generating code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>3D Flipping Cards</title>
    <style>
        body {
            min-height: 100vh;
            margin: 0;
            padding: 20px;
            display: flex;
            justify-content: center;
            align-items: center;
            flex-wrap: wrap;
            gap: 20px;
            background-color: #1db7c2;
            box-sizing: border-box;
        }

        .card {
            position: relative;
            width: 100px;
            height: 100px;
            perspective: 1000px;
            margin: 8px 2px;
        }

        .cover, .back {
            position: absolute;
            width: 100%;
            height: 100%;
            display: flex;
            justify-content: center;
            align-items: center;
       

### 7. Final Result Demonstration

In [5]:
from IPython.display import Video

# Embed and play local video files
Video("video2code_demo.mp4", width=800)