In [4]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m235.5/298.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


I first tried the new model, but processing a single pdf took over 15 min so I copied the following changes as suggested by Claude:

Limited text processing to first 5000 words
Limited to first 3 chunks only (I wonder if this is too few)

In [7]:
#With only 3 chunks it took 17 minutes for 19 documents with ColabPro

import os
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm
from transformers import pipeline

# Initialize BART-CNN summarizer with specific parameters
summarizer = pipeline("summarization",
                     model="facebook/bart-large-cnn",
                     device=0)  # Use GPU if available

def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def chunk_text(text, max_chunk_length=1024):
    """Split text into chunks that BART can process"""
    sentences = text.split('.')
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence = sentence.strip() + '.'
        if current_length + len(sentence) <= max_chunk_length:
            current_chunk.append(sentence)
            current_length += len(sentence)
        else:
            if current_chunk:
                chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_length = len(sentence)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

def summarize_text(text, max_summary_length=150):
    """Summarize text using BART-CNN with optimized chunking"""
    # Clean and prepare text
    text = ' '.join(text.split())  # Remove excessive whitespace

    # Only process first 5000 words to avoid excessive processing
    text = ' '.join(text.split()[:5000])

    # Split into smaller chunks
    chunks = chunk_text(text, max_chunk_length=1024)

    # Process only first 3 chunks to keep processing time reasonable
    chunks = chunks[:3]

    summaries = []
    for chunk in chunks:
        try:
            summary = summarizer(chunk,
                               max_length=max_summary_length,
                               min_length=30,
                               do_sample=False,
                               truncation=True)
            summaries.append(summary[0]['summary_text'])
        except Exception as e:
            print(f"Chunk summarization failed: {e}")
            continue

    # Combine and summarize again if needed
    final_summary = ' '.join(summaries)
    if len(final_summary.split()) > max_summary_length:
        try:
            final_summary = summarizer(final_summary,
                                     max_length=max_summary_length,
                                     min_length=30,
                                     do_sample=False)[0]['summary_text']
        except Exception as e:
            print(f"Final summarization failed: {e}")

    return final_summary

# Test with a single paper first
def process_single_paper(paper):
    try:
        print(f"Processing: {paper['title']}")
        text = extract_pdf(paper['url'])
        print(f"Text extracted, length: {len(text.split())} words")
        summary = summarize_text(text)
        print(f"Summary generated, length: {len(summary.split())} words")
        return summary
    except Exception as e:
        print(f"Error processing paper: {e}")
        return "Processing failed"

# Process papers with progress bar
for paper in tqdm(papers):
    paper["summary"] = process_single_paper(paper)
    # Print immediate results
    print(f"\nTitle: {paper['title']}\nSummary: {paper['summary']}\n{'='*50}\n")

Device set to use cpu
  0%|          | 0/19 [00:00<?, ?it/s]

Processing: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Text extracted, length: 12256 words


  5%|▌         | 1/19 [00:52<15:52, 52.93s/it]

Summary generated, length: 132 words

Title: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Summary: Interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. Such exist- ing datasets are crawled from webpage, facing challenges like low knowledge density and loose image-text relations. On the other hand, the internet hosts vast instructional videos that are widely used by humans to learn foundational subjects. Our textbook collects over 2. 5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to system- atically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and tex- tual knowledge (OCR) from the videos. Vision-Language Models (VLMs) deliver exceptional per- formance across a variety of visual tasks, including image captioning, dialogue, and visual question answering. These advancements can be primarily attri

 11%|█         | 2/19 [01:31<12:35, 44.41s/it]

Summary generated, length: 93 words

Title: VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Summary: VideoAnydoor preserves the fine-grained object details and enables users to control the motion with boxes or point trajectories. Users could further add multiple objects iteratively or swap objects in the same video. VideoAnydoor is a zero-shot video object insertion frame- work. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net. VideoAnydoor demonstrates significant superiority over ex- isting methods. It naturally supports various downstream applications without task-specific fine-tuning. This ability has broad potential for real-world applications, like video composition, video virtual try-on, video face changing.

Processing: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Text extracted, length: 8689 words


 16%|█▌        | 3/19 [02:18<12:13, 45.84s/it]

Summary generated, length: 110 words

Title: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Summary: CodeElo is a benchmarking tool for large language models (LLMs) There is a growing need to develop more challenging and comprehensive benchmarks. Existing benchmarks fall short due to the unavailability of private test cases. CODE ELO benchmark is mainly based on the official CodeForces1 platform. It tries to align with the platform as much as possible. We provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time.  o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems. Detailed analysis experiments are also conducted to provide insights into performance across algorithms.

Processing: VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Text extracted, leng

 21%|██        | 4/19 [03:41<15:05, 60.39s/it]

Summary generated, length: 31 words

Title: VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Summary: Video Large Language Models (Video LLMs) struggle with capturing fine-grained spa- tial and temporal details. We introduce the VideoRefer Suite to em- power Video LLM for finer-level spatial-temporal video un- derstanding.

Processing: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Text extracted, length: 7176 words


 26%|██▋       | 5/19 [04:33<13:21, 57.25s/it]

Summary generated, length: 128 words

Title: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Summary: Latent diffusion models with Transformer architectures excel at generating high-fidelity images. Increasing the per-token feature dimension in visual tokenizers improves reconstruction quality. However, it re- quires substantially larger diffusion models and more train- ing iterations to achieve comparable generation. VA-VAE(Vision foundation model Aligned Variational AutoEncoder) significantly ex- pands the reconstruction-generation frontier of latent dif- fusion models. It enables faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. LightningDiT achieves state-of-the-art performance on Ima- geNet 256×256 generation. The latent diffusion model [33] utilizes a continuous-valued variational autoencoder (V AE) [17], or visual tokenizer. Increasing the dimension of the vi- sual tokenizer enhances detail reconstruction 

 32%|███▏      | 6/19 [05:21<11:43, 54.15s/it]

Summary generated, length: 127 words

Title: LTX-Video: Realtime Video Latent Diffusion
Summary: LTX-Video is a transformer-based latent diffusion model that adopts a holistic approach to video generation. It seamlessly integrates the responsibilities of the Video-V AE and the denoising transformer. It achieves a high compression ratio of 1:192. The V AE decoder is tasked with both latent-to-pixel conversion and the final denoising step. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768×512 resolution in just 2 seconds on an Nvidia H100 GPU. The source code and pre-trained models are publicly available2, setting a new benchmark for accessible and scalable video generation. The rise of text-to-video models such as Sora and MovieGen has demonstrated the effectiveness of spatiotemporal transformers with self- attention and a global receptive field. However, extending this approach to video presents significant challenges.

Processing: ProgCo: Program

Your max_length is set to 150, but your input_length is only 141. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=70)
 37%|███▋      | 7/19 [06:01<09:56, 49.69s/it]

Summary generated, length: 95 words

Title: ProgCo: Program Helps Self-Correction of Large Language Models
Summary: Program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-executing verification pseudo-programs. ProgCo: Program Helps Self-Correction of Large Language Models. ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. Then, program-driven refinement (ProgRe) re- ceives feedback from ProgVe, conducts dual re- flection and refinement on both responses and verification programs. Self-correction is an expected capability of LLMs, wherein the LLM first needs to reflect on its initial output, iden- tify potential issues and generate feedback. However, studies have shown that current LLMs severely lack this capability.

Processing: MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
Text extracted, length: 17152 words


 42%|████▏     | 8/19 [06:47<08:53, 48.49s/it]

Summary generated, length: 117 words

Title: MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
Summary: Recent advancements in foundation models have enhanced AI systems’ capabili- ties in autonomous tool usage and reasoning. The study was conducted by the Bangladesh University of Engineering and Technology, Monash University, and the Qatar Computing Research Institute. M APEVAL is a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries. Using M APEVAL, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3. 5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive per- formance overall. All models still fall short of human performance by more than 20% on average, struggling with complex map images.

Processing: A3: Android Agent Arena for

 47%|████▋     | 9/19 [07:55<09:05, 54.52s/it]

Summary generated, length: 140 words

Title: A3: Android Agent Arena for Mobile GUI Agents
Summary:  abstract AI agents have become increasingly prevalent in recent years. Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. Many existing datasets focus on static frame evaluations and fail to provide a comprehen- sive platform for assessing performance. A3 is an open-source system for developing AI agents. It includes 21 widely used general third-party apps and 201 tasks representative of common user scenar- ios. The project is available at https://yuxiangchai. io/Android-Agent-Arena/. Existing mobile AI assistants such as Siri, Xiao AI, and Bixby have demonstrated the potential of mobile agents to facilitate interactions between hu- man users and mobile devices. But those assistants are only effective in managing the rou- tine tasks such as reporting weather condition and performing web searches due to the nature that they use APIs to

Your max_length is set to 150, but your input_length is only 139. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=69)
 53%|█████▎    | 10/19 [09:00<08:39, 57.69s/it]

Summary generated, length: 108 words

Title: MLLM-as-a-Judge for Image Safety without Human Labeling
Summary: Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets. The research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLM does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safetyrules. MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, and making quick judgments. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.

Processing: Unifying Specialized Visual Encoders for Video Language Models
Text extracted, length: 

 58%|█████▊    | 11/19 [10:15<08:23, 62.90s/it]

Summary generated, length: 137 words

Title: Unifying Specialized Visual Encoders for Video Language Models
Summary: Unifying Specialized Visual Encoders for Video Language Models. MERV, Multi-Encoder Representation of Videos, leverages multiple frozen visual encoders to create a unified representation of video. MERV is up to 3. 7% better in accuracy than Video-LLaV A across the standard suite video understanding benchmarks. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods. Video Large Language Models (VideoLLMs) connect pretrained vision encoders to LLMs by training a modality bridge from the vision space to the language space. Their vision-language pretraining naturally lends itself as a bridge between the vision input and the LLM. Most multimodal LLMs, such as LLaV A for images and Video-LLaV A (Lin et al. , 2023a) for videos, opt for contrast

 63%|██████▎   | 12/19 [10:58<06:38, 56.94s/it]

Summary generated, length: 122 words

Title: Dynamic Scaling of Unit Tests for Code Reward Modeling
Summary: Large language models (LLMs) often struggle to produce accurate solutions on the first attempt for code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. CodeRM- 8B is a lightweight yet effective unit test genera- tor that enables efficient and high-quality unit test scaling. We implement a dy- namic scaling mechanism that adapts the num- ber of unit tests based on problem difficulty, further improving efficiency. Code generation aims to automatically produce code solutions that satisfy programming require- ments. Recent advancements in large lan- guage models (LLMs) have shown significant progress in this domain. However, generating correct code. code on the first attempt remains challenging due to the inherent complexity of reasoning required.

Processing: SeedVR: Seeding Infinity in 

 68%|██████▊   | 13/19 [11:35<05:05, 50.85s/it]

Summary generated, length: 87 words

Title: SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
Summary: SeedVR is over2× faster than existing diffusion-based video restoration approaches. With delicate designs, SeedVR is as efficient as the Stable Diffusion Upscaler [2], even with five times the parameter count. SeedVR is a diffusion transformer designed to handle real-world video restoration. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR’s superiority over existing methods for generic video restoration.

Processing: MapQaTor: A System for Efficient Annotation of Map Query Datasets
Text extracted, length: 5705 words


 74%|███████▎  | 14/19 [12:20<04:05, 49.03s/it]

Summary generated, length: 122 words

Title: MapQaTor: A System for Efficient Annotation of Map Query Datasets
Summary:  MAPQATOR is a system for Efficient Annotation of Map Query Datasets. It stream- lines the creation of reproducible, traceable map-based QA datasets. MAPQA- TOR centralizes data retrieval, annotation, and visualization within a single platform. With its plug-and- play architecture, MAPQATOR enables seam- less integration with any maps API. By caching API re- sponses, the platform ensures consistent ground truth. In recent years, mapping and navigation services have transformed the way individuals access and interact with location-based information. Platforms such as Google Maps 1 and Apple Maps 2 have become essential tools. However, while these ser- vices offer extensive geospatial data, they often struggle with understanding and processing natural language queries. This limitation hampers their effectiveness for users seeking to obtain specific information.

Process

 79%|███████▉  | 15/19 [13:05<03:11, 47.79s/it]

Summary generated, length: 128 words

Title: Nested Attention: Semantic-aware Attention Values for Concept Personalization
Summary: Nested attention mechanism attaches a localized, expressive representation of a subject to a single text token. This approach improves identity preservation while maintaining the model’s prior, and can combine multiple personalized concepts in a single image. Nested Attention is a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while ad- hering to text prompts. Personalization of text-to-image models en- ables users to generate captivating images featuring their own personal data. A key challenge in personalizing text to image models is balancing identity preservation and prompt alignment. Most encoder-based works tackle personalization by encoding the s

 84%|████████▍ | 16/19 [14:28<02:55, 58.46s/it]

Summary generated, length: 34 words

Title: Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing
Summary: State Space Models (SSMs) have emerged as a compelling alternative to transformers. SSMs operate in two modes: convolution and recurrence. During convolutional mode, SSMs assume visibility of the entire sequence and utilize hardware-optimized convolutions.

Processing: SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Text extracted, length: 10836 words


 89%|████████▉ | 17/19 [15:18<01:52, 56.08s/it]

Summary generated, length: 112 words

Title: SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Summary: SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization. Human action understanding is crucial for the advancement of multimodal systems. In this work, we address the more challenging task of Fine-grained Action Recognition. Given the high costs of annotating fine- grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL) SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi- supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. The features extracted by our SeFAR could largely promote the ability of multimodal foundation models to un- derstand fine- grained and domain-specific semantics.

Proce

 95%|█████████▍| 18/19 [16:02<00:52, 52.38s/it]

Summary generated, length: 110 words

Title: Population Aware Diffusion for Time Series Generation
Summary: Population Aware Diffusion for Time Series Generation Yang Li, Han Meng, Zhenyu Bi, Ingolv Urnes, Haipeng Chen. Diffusion models have shown promising ability in generat- ing high-quality time series (TS) data. Population-aware Diffusion for Time Series (PaD-TS) is a new TS generation model that better pre- serves the population-level properties. It can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining performance comparable to state- of-the-art models. Time series data exists in a broad spectrum of real-world domains. TS models have been used in these domains for effective data analy- sis and prediction tasks. Developing such models requires rich and high-quality TS datasets.

Processing: Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
Text extracted, length: 13773 words


100%|██████████| 19/19 [16:54<00:00, 53.39s/it]

Summary generated, length: 140 words

Title: Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
Summary: Transformers rely on both content-based and position-based addressing mecha- nisms to make predictions. Many current methods en- force rigid patterns in attention maps, limiting the ability to model long-range dependencies. Most positional encod- ings are learned as general biases, lacking the specialization required for different instances within a dataset. Textualized equiv- ariAnt Position Embedding (TAPE) is a novel framework that enhances positional embeddings. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of tra- ditional fixed patterns. Ex- tensive experiments shows that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks. Large transformer models have become dominant in natural language understanding, language generation, and complex r




In [8]:
# Generate markdown output
output = "# Paper Summaries\n\n"
for paper in papers:
    output += f"## {paper['title']}\n\n{paper['summary']}\n\n---\n\n"
printmd(output)

# Paper Summaries

## 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. Such exist- ing datasets are crawled from webpage, facing challenges like low knowledge density and loose image-text relations. On the other hand, the internet hosts vast instructional videos that are widely used by humans to learn foundational subjects. Our textbook collects over 2. 5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to system- atically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and tex- tual knowledge (OCR) from the videos. Vision-Language Models (VLMs) deliver exceptional per- formance across a variety of visual tasks, including image captioning, dialogue, and visual question answering. These advancements can be primarily attributed to the swift improvements of large language models (LLMs)

---

## VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

VideoAnydoor preserves the fine-grained object details and enables users to control the motion with boxes or point trajectories. Users could further add multiple objects iteratively or swap objects in the same video. VideoAnydoor is a zero-shot video object insertion frame- work. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net. VideoAnydoor demonstrates significant superiority over ex- isting methods. It naturally supports various downstream applications without task-specific fine-tuning. This ability has broad potential for real-world applications, like video composition, video virtual try-on, video face changing.

---

## CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

CodeElo is a benchmarking tool for large language models (LLMs) There is a growing need to develop more challenging and comprehensive benchmarks. Existing benchmarks fall short due to the unavailability of private test cases. CODE ELO benchmark is mainly based on the official CodeForces1 platform. It tries to align with the platform as much as possible. We provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time.  o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems. Detailed analysis experiments are also conducted to provide insights into performance across algorithms.

---

## VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Video Large Language Models (Video LLMs) struggle with capturing fine-grained spa- tial and temporal details. We introduce the VideoRefer Suite to em- power Video LLM for finer-level spatial-temporal video un- derstanding.

---

## Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Latent diffusion models with Transformer architectures excel at generating high-fidelity images. Increasing the per-token feature dimension in visual tokenizers improves reconstruction quality. However, it re- quires substantially larger diffusion models and more train- ing iterations to achieve comparable generation. VA-VAE(Vision foundation model Aligned Variational AutoEncoder) significantly ex- pands the reconstruction-generation frontier of latent dif- fusion models. It enables faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. LightningDiT achieves state-of-the-art performance on Ima- geNet 256×256 generation. The latent diffusion model [33] utilizes a continuous-valued variational autoencoder (V AE) [17], or visual tokenizer. Increasing the dimension of the vi- sual tokenizer enhances detail reconstruction but significantly re- duces generation quality. All results are evaluated on ImageNet 256 ×256 dataset with a fixed compute budget during diffusion model training.

---

## LTX-Video: Realtime Video Latent Diffusion

LTX-Video is a transformer-based latent diffusion model that adopts a holistic approach to video generation. It seamlessly integrates the responsibilities of the Video-V AE and the denoising transformer. It achieves a high compression ratio of 1:192. The V AE decoder is tasked with both latent-to-pixel conversion and the final denoising step. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768×512 resolution in just 2 seconds on an Nvidia H100 GPU. The source code and pre-trained models are publicly available2, setting a new benchmark for accessible and scalable video generation. The rise of text-to-video models such as Sora and MovieGen has demonstrated the effectiveness of spatiotemporal transformers with self- attention and a global receptive field. However, extending this approach to video presents significant challenges.

---

## ProgCo: Program Helps Self-Correction of Large Language Models

Program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-executing verification pseudo-programs. ProgCo: Program Helps Self-Correction of Large Language Models. ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. Then, program-driven refinement (ProgRe) re- ceives feedback from ProgVe, conducts dual re- flection and refinement on both responses and verification programs. Self-correction is an expected capability of LLMs, wherein the LLM first needs to reflect on its initial output, iden- tify potential issues and generate feedback. However, studies have shown that current LLMs severely lack this capability.

---

## MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Recent advancements in foundation models have enhanced AI systems’ capabili- ties in autonomous tool usage and reasoning. The study was conducted by the Bangladesh University of Engineering and Technology, Monash University, and the Qatar Computing Research Institute. M APEVAL is a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries. Using M APEVAL, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3. 5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive per- formance overall. All models still fall short of human performance by more than 20% on average, struggling with complex map images.

---

## A3: Android Agent Arena for Mobile GUI Agents

 abstract AI agents have become increasingly prevalent in recent years. Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. Many existing datasets focus on static frame evaluations and fail to provide a comprehen- sive platform for assessing performance. A3 is an open-source system for developing AI agents. It includes 21 widely used general third-party apps and 201 tasks representative of common user scenar- ios. The project is available at https://yuxiangchai. io/Android-Agent-Arena/. Existing mobile AI assistants such as Siri, Xiao AI, and Bixby have demonstrated the potential of mobile agents to facilitate interactions between hu- man users and mobile devices. But those assistants are only effective in managing the rou- tine tasks such as reporting weather condition and performing web searches due to the nature that they use APIs to perform task automation.

---

## MLLM-as-a-Judge for Image Safety without Human Labeling

Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets. The research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLM does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safetyrules. MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, and making quick judgments. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.

---

## Unifying Specialized Visual Encoders for Video Language Models

Unifying Specialized Visual Encoders for Video Language Models. MERV, Multi-Encoder Representation of Videos, leverages multiple frozen visual encoders to create a unified representation of video. MERV is up to 3. 7% better in accuracy than Video-LLaV A across the standard suite video understanding benchmarks. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods. Video Large Language Models (VideoLLMs) connect pretrained vision encoders to LLMs by training a modality bridge from the vision space to the language space. Their vision-language pretraining naturally lends itself as a bridge between the vision input and the LLM. Most multimodal LLMs, such as LLaV A for images and Video-LLaV A (Lin et al. , 2023a) for videos, opt for contrastively pretrained encoder like CLIP.

---

## Dynamic Scaling of Unit Tests for Code Reward Modeling

Large language models (LLMs) often struggle to produce accurate solutions on the first attempt for code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. CodeRM- 8B is a lightweight yet effective unit test genera- tor that enables efficient and high-quality unit test scaling. We implement a dy- namic scaling mechanism that adapts the num- ber of unit tests based on problem difficulty, further improving efficiency. Code generation aims to automatically produce code solutions that satisfy programming require- ments. Recent advancements in large lan- guage models (LLMs) have shown significant progress in this domain. However, generating correct code. code on the first attempt remains challenging due to the inherent complexity of reasoning required.

---

## SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

SeedVR is over2× faster than existing diffusion-based video restoration approaches. With delicate designs, SeedVR is as efficient as the Stable Diffusion Upscaler [2], even with five times the parameter count. SeedVR is a diffusion transformer designed to handle real-world video restoration. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR’s superiority over existing methods for generic video restoration.

---

## MapQaTor: A System for Efficient Annotation of Map Query Datasets

 MAPQATOR is a system for Efficient Annotation of Map Query Datasets. It stream- lines the creation of reproducible, traceable map-based QA datasets. MAPQA- TOR centralizes data retrieval, annotation, and visualization within a single platform. With its plug-and- play architecture, MAPQATOR enables seam- less integration with any maps API. By caching API re- sponses, the platform ensures consistent ground truth. In recent years, mapping and navigation services have transformed the way individuals access and interact with location-based information. Platforms such as Google Maps 1 and Apple Maps 2 have become essential tools. However, while these ser- vices offer extensive geospatial data, they often struggle with understanding and processing natural language queries. This limitation hampers their effectiveness for users seeking to obtain specific information.

---

## Nested Attention: Semantic-aware Attention Values for Concept Personalization

Nested attention mechanism attaches a localized, expressive representation of a subject to a single text token. This approach improves identity preservation while maintaining the model’s prior, and can combine multiple personalized concepts in a single image. Nested Attention is a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while ad- hering to text prompts. Personalization of text-to-image models en- ables users to generate captivating images featuring their own personal data. A key challenge in personalizing text to image models is balancing identity preservation and prompt alignment. Most encoder-based works tackle personalization by encoding the subject into a large number of visual tokens.

---

## Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

State Space Models (SSMs) have emerged as a compelling alternative to transformers. SSMs operate in two modes: convolution and recurrence. During convolutional mode, SSMs assume visibility of the entire sequence and utilize hardware-optimized convolutions.

---

## SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization. Human action understanding is crucial for the advancement of multimodal systems. In this work, we address the more challenging task of Fine-grained Action Recognition. Given the high costs of annotating fine- grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL) SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi- supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. The features extracted by our SeFAR could largely promote the ability of multimodal foundation models to un- derstand fine- grained and domain-specific semantics.

---

## Population Aware Diffusion for Time Series Generation

Population Aware Diffusion for Time Series Generation Yang Li, Han Meng, Zhenyu Bi, Ingolv Urnes, Haipeng Chen. Diffusion models have shown promising ability in generat- ing high-quality time series (TS) data. Population-aware Diffusion for Time Series (PaD-TS) is a new TS generation model that better pre- serves the population-level properties. It can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining performance comparable to state- of-the-art models. Time series data exists in a broad spectrum of real-world domains. TS models have been used in these domains for effective data analy- sis and prediction tasks. Developing such models requires rich and high-quality TS datasets.

---

## Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

Transformers rely on both content-based and position-based addressing mecha- nisms to make predictions. Many current methods en- force rigid patterns in attention maps, limiting the ability to model long-range dependencies. Most positional encod- ings are learned as general biases, lacking the specialization required for different instances within a dataset. Textualized equiv- ariAnt Position Embedding (TAPE) is a novel framework that enhances positional embeddings. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of tra- ditional fixed patterns. Ex- tensive experiments shows that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks. Large transformer models have become dominant in natural language understanding, language generation, and complex reasoning. Due to the softmax function, attention often generates a sparse mask, extracting a limited subset of tokens for interaction. Through this interpretation, attention can be understood as an addressing mechanism.

---

