# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [1]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm
import json

In [2]:
path = os.path.abspath("")
API_KEY = json.load(open(path+"/../../apikeys.json"))["Gemini"]
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [3]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

Code to extract text from PDFs.

In [5]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [6]:
LLM = "gemini-2.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [15]:
for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content("Make a table with the strengths and weaknesses of this paper. The table must be in html format. Return only the html code." + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

 54%|█████▍    | 7/13 [05:42<03:07, 31.18s/it]

Generation failed


100%|██████████| 13/13 [16:02<00:00, 74.03s/it] 


We print the results to a html file.

In [16]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>tables generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> {paper["summary"].replace("`", "")[4:]}'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [17]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist](https://arxiv.org/pdf/2511.08521)**<br>```html
<table border="1">
    <thead>
        <tr>
            <th>Strengths</th>
            <th>Weaknesses</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <strong>Unified & Omni-capable Framework:</strong> Integrates video understanding, segmentation, editing, and generation into cohesive, complex workflows, bridging the gap between specialized models.
            </td>
            <td>
                <strong>Reliance on External SOTA Models:</strong> While modularity is a strength, the system's overall performance is inherently dependent on the quality, availability, and stability of the underlying third-party SOTA models (e.g., Seedance, Runway Aleph, InternVL3, Claude-sonnet-4) that it orchestrates.
            </td>
        </tr>
        <tr>
            <td>
                <strong>Open-Source Contribution:</strong> Both the UniVA platform and the UniVA-Bench benchmark are fully open-sourced, aiming to catalyze research in interactive, agentic, and general-purpose video intelligence.
            </td>
            <td>
                <strong>Potential for High Computational Cost:</strong> Orchestrating multiple LLM calls and diverse tools in complex, iterative workflows can be computationally intensive and potentially slower compared to monolithic models, though the paper doesn't explicitly detail this as a weakness.
            </td>
        </tr>
        <tr>
            <td>
                <strong>Advanced Agentic Architecture:</strong> Employs a "Plan-and-Act" dual-agent design (Planner interprets user intent, Executor invokes tools) for highly automated, proactive, and iterative video creation with self-reflection capabilities.
            </td>
            <td>
                <strong>Nuanced Performance Trade-offs:</strong> In some specific scenarios, like "Entities2Video," UniVA might prioritize overall narrative coherence and instruction complexity over strict pixel-level subject consistency (DINO Score), indicating a trade-off where specialized models might still lead in isolated, fine-grained metrics.
            </td>
        </tr>
        <tr>
            <td>
                <strong>Robust Long-Horizon Reasoning:</strong> Utilizes a hierarchical multi-level memory (global knowledge, task context, user-specific preferences) to sustain long-horizon reasoning, contextual continuity, and enable iterative, multi-round interactions.
            </td>
            <td>
                <strong>Engineering Complexity:</strong> Managing and maintaining a large-scale, modular system with numerous diverse tool servers and an MCP protocol might introduce significant engineering overhead for deployment, updates, and debugging compared to simpler, monolithic solutions.
            </td>
        </tr>
        <tr>
            <td>
                <strong>Modular & Extensible Tool Hub:</strong> Built on an MCP (Model Context Protocol) framework, allowing seamless, plug-and-play integration of diverse state-of-the-art AI (generation, understanding, editing, tracking) and non-AI tools, forming an open ecosystem.
            </td>
            <td>
                <strong>Scalability of Memory Management:</strong> While hierarchical memory is beneficial, managing and efficiently retrieving information from potentially very large global or user-specific memory stores could become a challenge as the system scales to numerous users and extensive interaction histories.
            </td>
        </tr>
        <tr>
            <td>
                <strong>Comprehensive Evaluation Benchmark (UniVA-Bench):</strong> Introduces a novel benchmark suite for multi-step video tasks, evaluating not just task performance but also agentic aspects like plan quality, memory utilization, and tool-routing efficiency.
            </td>
            <td></td>
        </tr>
        <tr>
            <td>
                <strong>Strong Performance Across Tasks:</strong> Demonstrates superior or competitive performance in generation, understanding, editing, and segmentation, particularly excelling in complex, multi-step scenarios and fulfilling holistic user intent (confirmed by MLLM-as-a-Judge and human evaluation).
            </td>
            <td></td>
        </tr>
        <tr>
            <td>
                <strong>Industrial-Level Capabilities:</strong> Supports iterative and "any-conditioned" video workflows, ensuring cinematic quality, consistency, scalability, and extensibility across all stages of video production.
            </td>
            <td></td>
        </tr>
    </tbody>
</table>
```<br><br>

**[Black-Box On-Policy Distillation of Large Language Models](https://arxiv.org/pdf/2511.10643)**<br>```html
<table border="1">
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Enables Black-Box Distillation for Proprietary LLMs:</strong> Effectively distills knowledge from teacher models (e.g., GPT-5) without access to internal logits or parameters.</td>
      <td><strong>Increased Computational Cost:</strong> Training one 14B model takes approximately 30 hours on 16 H100 GPUs, which is substantial compared to simpler distillation methods.</td>
    </tr>
    <tr>
      <td><strong>Facilitates On-Policy Learning in Black-Box Setting:</strong> Overcomes the challenge of lacking probability-level supervision for a student's self-generated responses.</td>
      <td><strong>Complexity of Adversarial Training:</strong> Balancing a generator and discriminator in a minimax game can be inherently complex and sensitive, despite the proposed solutions for stability.</td>
    </tr>
    <tr>
      <td><strong>Novel Generative Adversarial Distillation (GAD) Framework:</strong> Utilizes a minimax game between a student (generator) and a discriminator to provide implicit feedback.</td>
      <td><strong>Potential Hyperparameter Sensitivity:</strong> RL algorithms (GRPO) and adversarial networks often require careful tuning of numerous hyperparameters.</td>
    </tr>
    <tr>
      <td><strong>Adaptive and Stable On-Policy Reward Model:</strong> Discriminator co-evolves with the student, providing stable, adaptive feedback and effectively preventing reward hacking.</td>
      <td><strong>Dependency on Discriminator Quality:</strong> The effectiveness of the implicit reward signal relies heavily on the discriminator's ability to accurately distinguish between teacher and student outputs.</td>
    </tr>
    <tr>
      <td><strong>Superior Performance over Baselines:</strong> Consistently and significantly outperforms traditional sequence-level knowledge distillation (SeqKD) across various model sizes and datasets.</td>
      <td><strong>Reliance on LLM-as-a-Judge for Evaluation:</strong> While human evaluation is also conducted, the primary automatic metric (GPT-4o scores) is based on an LLM, which may introduce its own biases.</td>
    </tr>
    <tr>
      <td><strong>Achieves Teacher-Level Performance:</strong> Student models distilled with GAD (e.g., Qwen2.5-14B) become comparable to the proprietary teacher (GPT-5-Chat).</td>
      <td><strong>Limited Comparison to Other Advanced Black-Box Methods:</strong> Primarily compared against SeqKD (SFT), without exploring other potentially more advanced or recent black-box distillation techniques (if available beyond SFT).</td>
    </tr>
    <tr>
      <td><strong>Strong Out-of-Distribution (OOD) Generalization:</strong> Delivers robust improvements on OOD benchmarks where SeqKD shows marginal or negative gains.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Avoids Overfitting Local Patterns:</strong> Captures the teacher's global stylistic characteristics rather than memorizing local lexical patterns.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Mode-Seeking Behavior:</strong> Learns reachable modes of the teacher's distribution, proving more effective than mode-covering approaches.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Robust Training Procedure:</strong> Includes a crucial warmup stage for both generator and discriminator, ensuring effective and stable adversarial optimization.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Comprehensive Evaluation:</strong> Validated through extensive automatic (GPT-4o on multiple datasets) and human evaluations.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Handles Incompatible Tokenizers:</strong> Effective even when student and teacher models have incompatible tokenizers, broadening applicability.</td>
      <td></td>
    </tr>
  </tbody>
</table>
```<br><br>

**[Depth Anything 3: Recovering the Visual Space from Any Views](https://arxiv.org/pdf/2511.10647)**<br>```html
<table border="1" style="width:100%; border-collapse: collapse;">
    <thead>
        <tr>
            <th style="padding: 8px; text-align: left; background-color: #f2f2f2;">Strengths</th>
            <th style="padding: 8px; text-align: left; background-color: #f2f2f2;">Weaknesses</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style="padding: 8px; text-align: left;">
                <ul>
                    <li><strong>Unified Any-View Geometry:</strong> Recovers spatially consistent 3D structure from an arbitrary number of visual inputs (monocular, multi-view, video), with or without known camera poses.</li>
                    <li><strong>Minimalist Architecture:</strong> Employs a single plain transformer (e.g., vanilla DINO encoder) as a backbone, avoiding complex, specialized architectures for different 3D tasks.</li>
                    <li><strong>Effective Depth-Ray Prediction Target:</strong> A singular depth-ray prediction target is shown to be sufficient for geometric reconstruction, obviating the need for complex multi-task learning.</li>
                    <li><strong>State-of-the-Art Performance:</strong> Sets new state-of-the-art across a new visual geometry benchmark (camera pose, any-view geometry, visual rendering) and outperforms Depth Anything 2 in monocular depth estimation.</li>
                    <li><strong>Teacher-Student Training Paradigm:</strong> Effectively leverages a powerful teacher model (trained on synthetic data) to generate high-quality pseudo-labels for real-world data, enhancing detail and completeness.</li>
                    <li><strong>Scalability and Generalization:</strong> Benefits from large-scale pretraining and an input-adaptive cross-view self-attention mechanism, leading to strong generalization across diverse scenes and input types.</li>
                    <li><strong>Efficient Inference:</strong> The DA3-Giant model offers competitive running speed compared to previous SOTA, and smaller variants achieve significantly higher frame rates (Table 8).</li>
                    <li><strong>Dual-DPT Head:</strong> A novel architecture designed to jointly output both depth and ray values, promoting strong interaction between the two prediction tasks.</li>
                    <li><strong>Foundation for Downstream Tasks:</strong> Demonstrated as a strong backbone for Feed-Forward Novel View Synthesis (FF-NVS), outperforming specialized NVS models with a simple DPT head.</li>
                    <li><strong>Robust to Pose Information:</strong> Flexible design allows seamless integration of known camera poses when available, improving accuracy, or operating effectively without them.</li>
                </ul>
            </td>
            <td style="padding: 8px; text-align: left;">
                <ul>
                    <li><strong>High Training Resources:</strong> Training the largest DA3-Giant model requires significant computational resources (e.g., 128 H100 GPUs for 10 days), which might be prohibitive for smaller research groups.</li>
                    <li><strong>Teacher Model Dependency:</strong> The approach heavily relies on a powerful teacher model trained exclusively on extensive synthetic datasets, which might carry inherent biases or limitations from the synthetic domain despite alignment efforts.</li>
                    <li><strong>Performance Scaling with Model Size:</strong> While the 'Giant' variant achieves SOTA, the performance of smaller models (Base, Small) is notably lower, suggesting that substantial model capacity is still critical for top-tier results.</li>
                    <li><strong>Current Limitations in Dynamic Scenes:</strong> The paper states "Future work can extend its reasoning to dynamic scenes," implying the current model is primarily designed for static environments.</li>
                    <li><strong>No Integration of Language/Interaction Cues:</strong> The conclusion also mentions "integrate language and interaction cues" as future work, indicating these multimodal capabilities are not present in the current model.</li>
                    <li><strong>Residual Pose Drift:</strong> Although significantly improved, visual examples of camera trajectories (Fig. 5) can still show minor drifts or inconsistencies, particularly over longer sequences, a common challenge in visual SLAM.</li>
                </ul>
            </td>
        </tr>
    </tbody>
</table>
```<br><br>

**[Superpositional Gradient Descent: Harnessing Quantum Principles for
  Model Training](https://arxiv.org/pdf/2511.01918)**<br>```html
<table border="1">
    <thead>
        <tr>
            <th>Strengths</th>
            <th>Weaknesses</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Introduces a novel quantum-inspired optimization method (Superpositional Gradient Descent - SGD).</td>
            <td>Scalability and hardware constraints are acknowledged limitations for widespread adoption, especially with real quantum processors.</td>
        </tr>
        <tr>
            <td>Demonstrates faster convergence and lower final loss compared to AdamW on synthetic text classification and LLM fine-tuning tasks (Llama-3.2-1B-Instruct on GSM8K).</td>
            <td>The "quantum" aspect relies on simulations (Qiskit's statevector/Aer simulator) rather than actual quantum hardware, which is a common limitation in current quantum machine learning research.</td>
        </tr>
        <tr>
            <td>Provides a mathematical framework for integrating quantum superposition principles into classical gradient descent.</td>
            <td>The quantum circuit contribution is limited (e.g., 4 qubits, small circuit depth) to manage simulation overhead, which might limit the extent of "quantumness" in the optimization.</td>
        </tr>
        <tr>
            <td>Offers a practical pathway for leveraging quantum principles to enhance model behavior even before large-scale quantum computers are widely available, utilizing hybrid quantum-classical circuits.</td>
            <td>Per-epoch computational cost is higher (approximately 35% more time per epoch) compared to classical optimizers due to quantum circuit simulation, although total training time might be reduced.</td>
        </tr>
        <tr>
            <td>The hybrid approach is implemented using established tools (PyTorch, Qiskit's TorchConnector), making it accessible for further research.</td>
            <td>Hyperparameter selection, particularly for the quantum weight (λ) and number of qubits, is largely empirical, suggesting a need for more theoretical guidance.</td>
        </tr>
        <tr>
            <td>The paper provides new insights into the intersection of quantum computing and deep learning optimization.</td>
            <td>The "quantum-inspired perturbations" use sinusoidal modulation, which might be a simplified representation of quantum interference, rather than a direct quantum mechanical process.</td>
        </tr>
        <tr>
            <td>Achieves a notable reduction in training time to reach target accuracy (e.g., 37.8% reduction for text classification, 16% lower total time).</td>
            <td>Future work explicitly highlights scaling to larger models, exploring more sophisticated quantum circuit designs, and developing implementations for real quantum processors as open challenges, indicating current limitations.</td>
        </tr>
    </tbody>
</table>
```<br><br>

**[One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models](https://arxiv.org/pdf/2511.10629)**<br>```html
<table>
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><b>Exceptional Efficiency:</b> Achieves high-resolution synthesis with significantly lower latency and computational cost. It's a lightweight, single feed-forward pass module, offering nearly 3x lower decoding and upscaling time compared to pixel-space SR, and orders of magnitude faster than multi-stage diffusion methods (e.g., 3.52s for 2048x2048 vs. 20.77s for LSRNA–DemoFusion).</td>
      <td><b>Inherited Artifacts:</b> As an adapter, LUA faithfully upscales the generator's latent code. This means any errors or biases present in the initial low-resolution latent will persist and can be magnified in the high-resolution output, without a mechanism to correct them.</td>
    </tr>
    <tr>
      <td><b>High Image Quality at Scale:</b> Delivers comparable or superior perceptual quality and fidelity at higher resolutions (2048x2048 and 4096x4096) compared to many baselines. It effectively preserves fine detail, edge continuity, and microstructure while introducing fewer artifacts (e.g., less noise, ringing, or plastic textures) than pixel-space SR.</td>
      <td><b>Performance Gap at Lowest Upscale Resolution:</b> At 1024x1024 resolution (from a 512px base), LUA's overall FID performance trails some of the strongest single-stage baselines. This is attributed to the inherent limitations of the 64x64 input latent concerning recoverable detail, although its patch-level fidelity (pFID) remains competitive.</td>
    </tr>
    <tr>
      <td><b>Seamless Integration & Ease of Deployment:</b> LUA is designed as a drop-in component, requiring no modifications to the base diffusion model, VAE, or additional diffusion stages. This simplifies integration into existing pipelines and avoids complex multi-stage inference.</td>
      <td><b>No Built-in Refinement for Base Errors:</b> The current design lacks a dedicated refinement mechanism to actively suppress or correct artifacts generated in the initial low-resolution image, relying solely on upscaling the given latent.</td>
    </tr>
    <tr>
      <td><b>Robust Generality and Scalability:</b> A single LUA backbone supports multiple upscaling factors (×2 and ×4) through scale-specific heads, eliminating the need for separate models. It also demonstrates strong cross-VAE generalization, transferring across models like FLUX, SD3, and SDXL with only a minimal modification to the first convolution layer and brief fine-tuning.</td>
      <td><b>Undeveloped for Video and Temporal Consistency:</b> The paper identifies extending the adapter to video with temporal consistency as future work, implying that the current architecture is not directly designed or optimized for dynamic settings requiring frame-to-frame coherence.</td>
    </tr>
    <tr>
      <td><b>Effective Training Strategy:</b> Utilizes a progressive three-stage curriculum that combines latent-domain structural and spectral alignment with pixel-domain consistency and edge-aware refinement. This robust training process ensures stable learning and high-fidelity decoding without relying on additional diffusion steps.</td>
      <td></td>
    </tr>
  </tbody>
</table>
```<br><br>

**[AlphaResearch: Accelerating New Algorithm Discovery with Language Models](https://arxiv.org/pdf/2511.08522)**<br>```html
<table border="1">
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Novel Dual Research Environment:</strong> Combines execution-based verification and a simulated real-world peer review environment to synergize feasibility and innovation in algorithm discovery.</td>
      <td><strong>Limited Win Rate Against Humans:</strong> Achieved a 2/8 win rate in head-to-head comparisons with human researchers on the AlphaResearchComp benchmark, underperforming in 6 out of 8 problems.</td>
    </tr>
    <tr>
      <td><strong>Achieved State-of-the-Art Performance:</strong> Discovered a new algorithm for the "packing circles" problem that achieved best-of-known performance, surpassing human researchers and strong baselines like AlphaEvolve.</td>
      <td><strong>Struggles with Established Human-Best Solutions:</strong> Failed to show improvement when initialized with human-best solutions on problems like "Littlewood Polynomials" and "MSTD (n=30)", indicating difficulty in surpassing well-optimized human-designed algorithms.</td>
    </tr>
    <tr>
      <td><strong>Autonomous Iterative Refinement:</strong> Demonstrates the ability to iteratively propose new ideas, verify them in the dual environment, and optimize research proposals for better performance, showing continuous reward growth.</td>
      <td><strong>High Idea Rejection Rate by RM:</strong> Approximately 30-40% of newly proposed ideas are discarded by the reward model (RM) for being below the quality threshold, suggesting inefficiency in idea generation.</td>
    </tr>
    <tr>
      <td><strong>Effective Real-World Peer Review Simulation:</strong> The AlphaResearch-RM-7B reward model, fine-tuned on ICLR peer review records, demonstrates 72% accuracy in identifying good ideas, significantly outperforming baseline LLMs and human experts.</td>
      <td><strong>Execution Failure Rate:</strong> A significant portion of ideas that pass the RM still fail during the program-based execution phase (e.g., 28.9% for "Packing Circles"), indicating challenges in generating consistently executable and valid code.</td>
    </tr>
    <tr>
      <td><strong>Transparent & Reproducible Benchmark (AlphaResearchComp):</strong> Introduced a new evaluation benchmark of 8 open-ended algorithmic problems with carefully curated executable pipelines, objective metrics, and human-best baselines for transparent and reproducible evaluation.</td>
      <td><strong>Current Scope Limitations:</strong> The paper acknowledges limitations in expanding coverage to more realistic and complex applications like accelerating tensor computations, suggesting a narrow initial focus.</td>
    </tr>
    <tr>
      <td><strong>Provides Valuable Insights:</strong> Conducts a comprehensive analysis of the 6/8 failure cases, offering valuable insights and directions for future research in autonomous algorithm discovery.</td>
      <td><strong>Simplicity of Approach:</strong> The current approach is described as "simplest and most straight-forward," implying that more advanced strategies, such as augmenting agents with external tools, are left for future work.</td>
    </tr>
    <tr>
      <td><strong>Outperforms Other LLM Agents:</strong> Showed better performance than OpenEvolve and slightly surpassed ShinkaEvolve on the "packing circles" problem, demonstrating an advantage in accelerating algorithm discovery.</td>
      <td><strong>Reward Model Training Limitations:</strong> The reward model is trained on a relatively small model (Qwen-2.5-7B-Instruct) and a specific dataset size (24,445 ICLR records), indicating potential for improvement with larger models and datasets.</td>
    </tr>
  </tbody>
</table>
```<br><br>

**[Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation](https://arxiv.org/pdf/2511.10547)**<br>Paper not available<br><br>

**[Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following](https://arxiv.org/pdf/2511.10507)**<br>```html
<table>
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Novel, High-Quality Benchmark (AdvancedIF):</strong>
        <ul>
          <li>Comprehensive evaluation of advanced instruction following (complex, multi-turn, system-prompted instructions).</li>
          <li>Features over 1,600 prompts and expert-curated rubrics, carefully written and reviewed by human experts.</li>
          <li>Challenging for State-of-the-Art (SoTA) LLMs (best achieved ~70%), highlighting a significant gap for improvement.</li>
          <li>Simulates real user-bot interactions by having human experts interact with LLMs to generate multi-turn prompts.</li>
          <li>First benchmark to feature pure expert-written prompts and rubrics across complex IF, multi-turn carried context, and system prompt steerability.</li>
        </ul>
      </td>
      <td><strong>Benchmark Availability:</strong>
        <ul>
          <li>AdvancedIF is "to be released shortly," limiting immediate public access for broader research and reproducibility until its release.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td><strong>Effective Training Pipeline (RIFL):</strong>
        <ul>
          <li>Substantially improves LLM instruction-following capabilities (6.7% absolute gain on AdvancedIF and strong results on public benchmarks).</li>
          <li><strong>Rubric Generator:</strong> Fine-tuned model effectively generates high-quality rubrics at scale with an F1 score of 0.790.</li>
          <li><strong>Finetuned Rubric Verifier:</strong> Two-stage (SFT + RL) training yields a reliable verifier with significantly higher human agreement (F1 0.728) than vanilla LLM judges, mitigating reward hacking.</li>
          <li><strong>Reward Shaping:</strong> Introduces additional criteria to effectively prevent reward hacking issues during RL training.</li>
          <li>Ablation studies confirm the effectiveness of each individual component within the RIFL pipeline.</li>
          <li>Establishes rubrics as a powerful and interpretable tool for both training and evaluating LLMs' advanced instruction-following abilities.</li>
        </ul>
      </td>
      <td><strong>Reliance on Initial Human Annotation:</strong>
        <ul>
          <li>Although RIFL uses synthetic generation for scale, the initial training data for both the rubric generator and verifier still relies on expensive and time-consuming expert human annotations.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td><strong>Addresses Key Challenges:</strong>
        <ul>
          <li>Tackles the lack of high-quality human-annotated benchmarks and reliable reward signals for complex instruction following.</li>
          <li>Provides a scalable learning pipeline for advanced IF, overcoming issues with unreliable rubric generators/verifiers and reward hacking.</li>
        </ul>
      </td>
      <td><strong>Imperfect Rubric Generator Accuracy:</strong>
        <ul>
          <li>While effective, the rubric generator's F1 score of 0.790 suggests a non-trivial error rate (21%) in generating perfectly aligned rubrics, which could introduce noise into training.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td><strong>Strong Empirical Validation:</strong>
        <ul>
          <li>Extensive experiments demonstrate significant improvements across various challenging benchmarks.</li>
        </ul>
      </td>
      <td><strong>Limited Reward Design Exploration:</strong>
        <ul>
          <li>The paper acknowledges that a "more comprehensive study of reward design" (e.g., weighted sum of criteria) is left for future work, implying the current reward function might not be fully optimized.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td></td>
      <td><strong>Ad-hoc Reward Hacking Prevention:</strong>
        <ul>
          <li>While effective in the presented experiments, the specific "additional criteria" for reward shaping are pragmatic and might need continuous adaptation as models evolve new ways to hack rewards.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td></td>
      <td><strong>Computational Cost (Implicit):</strong>
        <ul>
          <li>Reinforcement Learning, especially for large language models, is inherently computationally intensive, which can be a practical barrier for widespread adoption or iteration.</li>
        </ul>
      </td>
    </tr>
  </tbody>
</table>
```<br><br>

**[Music Flamingo: Scaling Music Understanding in Audio Language Models](https://arxiv.org/pdf/2511.10289)**<br>```html
<table border="1">
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Novel Large Audio-Language Model (LALM):</strong> Music Flamingo is specifically designed for comprehensive music understanding, moving beyond previous models that treated music as a secondary modality.</td>
      <td><strong>Limited Understanding of Underrepresented Cultures:</strong> Still has limitations in understanding music from underrepresented or skewed cultural traditions, indicating a need for more diverse global music data.</td>
    </tr>
    <tr>
      <td><strong>Addresses Core Music Challenges:</strong> Effectively tackles the dynamic, layered, and information-dense nature of music, as well as the scarcity of high-quality music data.</td>
      <td><strong>Gaps in Specialized/Fine-Grained Tasks:</strong> Shows limitations in highly specialized tasks, such as fine-grained piano technique recognition and other specific instrument skills.</td>
    </tr>
    <tr>
      <td><strong>New Large-Scale, High-Quality Datasets:</strong>
        <ul>
          <li><strong>MF-Skills:</strong> A curated dataset with 4M+ samples featuring rich, multi-aspect, layered captions (harmony, structure, timbre, lyrics, cultural context) and detailed Q&A for full-length, multicultural songs.</li>
          <li><strong>MF-Think:</strong> A novel 300K chain-of-thought (CoT) dataset grounded in music theory, enhancing reasoning capabilities.</li>
        </ul>
      </td>
      <td><strong>Need for Broader Skill Coverage:</strong> Acknowledges the need to expand its coverage across additional musical skills to achieve a more comprehensive understanding.</td>
    </tr>
    <tr>
      <td><strong>Advanced Training Methodologies:</strong>
        <ul>
          <li>Enhanced Audio Flamingo 3 backbone with multilingual/multi-speaker ASR.</li>
          <li>Post-training with reasoning cold-start using MF-Think.</li>
          <li>GRPO-based reinforcement learning with custom rewards for explicit, step-by-step musical reasoning.</li>
          <li>Extended context length (up to ~24k tokens) and Rotary Time Embeddings (RoTE) for fine-grained temporal perception.</li>
        </ul>
      </td>
      <td><strong>Occasional Over-specification/Hallucinations:</strong> In qualitative evaluations, the model sometimes over-specifies musical details (e.g., asserting non-existent colorful chords or percussion layers), and genre misclassifications can lead to cascaded production hallucinations.</td>
    </tr>
    <tr>
      <td><strong>State-of-the-Art (SOTA) Performance:</strong> Achieves SOTA results across 12 music understanding and reasoning benchmarks, including music QA, MIR, and lyrics transcription, outperforming many open and closed-source LALMs.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Robust Evaluation:</strong> Validated through both human expert judgments and LLM-as-a-judge assessments (e.g., on SongCaps), demonstrating high accuracy, correctness, and coverage in natural language descriptions.</td>
      <td></td>
    </tr>
    <tr>
      <td><strong>Open-Source Commitment:</strong> Plans to release code, training recipes, and datasets, fostering community research and development.</td>
      <td></td>
    </tr>
  </tbody>
</table>
```<br><br>

**[MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples](https://arxiv.org/pdf/2511.10047)**<br>```html
<table>
    <thead>
        <tr>
            <th>Strengths</th>
            <th>Weaknesses</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <ul>
                    <li><b>Zero-shot and Training-free:</b> The method requires no labeled samples for training and operates without any training process or prompts, making it highly adaptable for industrial scenarios.</li>
                    <li><b>Multimodal Capability:</b> Flexibly supports single 2D, single 3D, or combined 2D+3D data for anomaly classification and segmentation.</li>
                    <li><b>Leverages Intrinsic Data Properties:</b> Explicitly exploits the key property that normal patches are highly similar across industrial products, while anomalies are diverse and isolated.</li>
                    <li><b>Novel Mutual Scoring Mechanism (MSM):</b> Introduces a unique paradigm where unlabeled samples mutually assign anomaly scores to each other.</li>
                    <li><b>Enhanced 3D Representation (IPG):</b> Iterative Point Grouping (IPG) reduces false positives from discontinuous surfaces in 3D data, leading to geometrically consistent patches.</li>
                    <li><b>Improved Anomaly Modeling (SNAMD):</b> Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) fuses multi-scale neighborhood cues and uses Similarity-Weighted Pooling (SWPooling) to better capture variable-sized anomalies and prevent missed detections.</li>
                    <li><b>Cross-modal Anomaly Enhancement (CAE):</b> Effectively fuses 2D and 3D scores to recover modality-specific missing anomalies, addressing limitations of single-modal detection.</li>
                    <li><b>Robust False Classification Suppression (RsCon):</b> Re-scoring with Constrained Neighborhood (RsCon) mitigates false positives and false negatives caused by local noise and weak anomalies.</li>
                    <li><b>Significant Performance Gains:</b> Achieves substantial improvements over previous zero-shot benchmarks and often outperforms most few-shot methods (e.g., +23.7% AP on MVTec 3D-AD, +19.3% on Eyecandies for AS).</li>
                    <li><b>High Robustness:</b> Maintains consistent and robust performance across full datasets, smaller subsets (minimal degradation), and varying ratios of normal samples (degradation below 3% even without normal samples).</li>
                    <li><b>Improved Efficiency:</b> MuSc-V2 is 5.6 times faster than its previous version, MuSc, due to optimization in feature aggregation.</li>
                </ul>
            </td>
            <td>
                <ul>
                    <li><b>Sensitivity to Subtle Impurities:</b> The SNAMD module, while good at detecting anomalies, can be overly sensitive to subtle foreign impurities in background regions (not true anomalies), leading to over-detection and slight performance drops on some datasets (e.g., VisA).</li>
                    <li><b>Dependency on Pre-trained Backbones:</b> Relies on large pre-trained vision (DINO ViT, CLIP ViT) and point (Point-MAE Point Transformer) models for feature extraction, meaning it's not an entirely "from-scratch" solution.</li>
                    <li><b>Computational Bottleneck in 3D:</b> Despite overall speed improvements, the initial feature extraction step for 3D point clouds (e.g., 722.6ms by Point-MAE) can remain a significant bottleneck for overall inference time.</li>
                    <li><b>Hyperparameter Sensitivity (Minor):</b> While generally robust, maintaining hyperparameters (e.g., IPG curvature threshold, IA range, RsCon window size) "within appropriate bounds is both inevitable and manageable," implying that some careful selection or minor tuning might still be necessary for optimal performance across vastly different scenarios.</li>
                    <li><b>Initial Susceptibility of MSM:</b> The core Mutual Scoring Mechanism, before refinement by RsCon, is acknowledged to be "susceptible to both false positives and false negatives in certain challenging cases" (e.g., local noise, weak anomalies).</li>
                </ul>
            </td>
        </tr>
    </tbody>
</table>
```<br><br>

**[AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models](https://arxiv.org/pdf/2511.10017)**<br>```html
<table border="1">
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>
        <ul>
          <li><strong>Novel Task Formulation:</strong> Introduces Fine-grained 3D Embodied Reasoning as a unified task, predicting structured triplets (3D mask, motion type, motion axis direction) for instruction-conditioned scenarios, coherently coupling spatial grounding and interaction reasoning.</li>
          <li><strong>Innovative Framework (AffordBot):</strong> Integrates 3D geometric information with Multimodal Large Language Models (MLLMs) via a holistic multimodal representation and a tailored Chain-of-Thought (CoT) reasoning paradigm.</li>
          <li><strong>Robust Multimodal Representation:</strong> Bridges 3D point cloud input to 2D-native MLLMs by rendering dynamic surround-view images and projecting 3D elements with geometric-semantic descriptors and adaptive label refinement, providing dense visual context without redundant video processing overhead.</li>
          <li><strong>Effective Chain-of-Thought Reasoning:</strong> Employs a structured "observe-then-infer" CoT pipeline, including active view selection by the MLLM to focus on the most informative viewpoint, followed by sequential affordance grounding and motion estimation.</li>
          <li><strong>State-of-the-Art Performance:</strong> Achieves superior results on the SceneFun3D dataset, outperforming existing methods in both affordance grounding and motion estimation, demonstrating strong generalization and physically grounded reasoning.</li>
          <li><strong>Improved Generalization:</strong> The approach shows enhanced spatial awareness and ability to handle complex scenarios compared to prior work.</li>
          <li><strong>Modularity and Scalability:</strong> The component-wise design allows for clear understanding of contributions, and the framework can leverage more advanced MLLMs (e.g., GPT-o1) for further performance gains.</li>
          <li><strong>Physically Plausible Reasoning:</strong> The CoT paradigm guides the MLLM towards semantically aligned and physically plausible inferences, crucial for real-world embodied agents.</li>
        </ul>
      </td>
      <td>
        <ul>
          <li><strong>High Dependency on Segmentation Quality:</strong> Performance is significantly limited by the accuracy of the initial 3D instance segmentation (e.g., Mask3D), with inaccurate or missing elements severely impacting downstream reasoning (identified as the "primary limiting factor").</li>
          <li><strong>Viewpoint Limitations:</strong> Despite dynamic surround-view generation, current scene-centric rendering can still lead to occlusion or poor visibility in complex scenes, and active perception could be further optimized.</li>
          <li><strong>Varying Performance Across Affordance Types:</strong> Performance disparities exist across different affordance categories, partly due to dataset class imbalance and varying initial segmentation quality for different object types.</li>
          <li><strong>Challenges with Small/Weakly-Textured Objects:</strong> The segmentation model struggles with small, weakly-textured objects, which often correspond to unique target elements, leading to noisier descriptors and hindering subsequent grounding and motion estimation.</li>
          <li><strong>Discretization of Motion Axes:</strong> While necessary for MLLM compatibility, discretizing continuous motion vectors into broad categories might lose some fine-grained directional information important for highly precise robotic manipulation.</li>
          <li><strong>Computational Resources:</strong> Deploying and running large MLLMs like Qwen2.5-VL-72B locally requires significant computational power (e.g., four NVIDIA A800 GPUs), which might be a barrier for some applications or research groups.</li>
          <li><strong>Potential for MLLM Biases:</strong> As acknowledged by the authors, reliance on MLLMs introduces a risk of inherent biases in the models, potentially affecting fairness and predictability in agent behavior.</li>
        </ul>
      </td>
    </tr>
  </tbody>
</table>
```<br><br>

**[SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control](https://arxiv.org/pdf/2511.09715)**<br>```html
<table border="1">
  <thead>
    <tr>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><b>Novelty:</b> First framework to enable continuous, fine-grained instruction control in instruction-based image editing models.</td>
      <td><b>Inherent Attribute Entanglement:</b> Despite disentangled control, the system can still exhibit attribute entanglement (e.g., skin tone affecting hair color) due to inherent couplings in the underlying base generative models.</td>
    </tr>
    <tr>
      <td><b>Fine-Grained & Disentangled Control:</b> Provides smooth, continuous, and interpretable adjustment of individual edit instruction strengths (suppression, full application, amplification) in multi-instruction prompts.</td>
      <td><b>Trade-off in Control Metrics:</b> Acknowledges a consistent trade-off between continuity, extrapolation, and disentanglement across different model configurations.</td>
    </tr>
    <tr>
      <td><b>Lightweight & Efficient:</b> Learns a single set of low-rank adaptation (LoRA) matrices, making the training computationally very lightweight, data-efficient, and fast.</td>
      <td><b>PPS vs. SPPS Complexity:</b> While Simplified Partial Prompt Suppression (SPPS) is robust and efficient, the full Partial Prompt Suppression (PPS) (offering finer multi-instruction control) requires parsing multi-instruction prompts and identifying token-level boundaries, adding a layer of complexity.</td>
    </tr>
    <tr>
      <td><b>No Per-Attribute Retraining:</b> Eliminates the need to train a new LoRA or embedding direction for each attribute or concept, generalizing across diverse edits.</td>
      <td><b>Scope for Further Efficiency Optimization:</b> The paper suggests that further investigation into efficiency-oriented design choices (e.g., training adapters on a subset of transformer blocks or specific timesteps) is left for future work, implying current implementation may not be maximally optimized in these specific areas.</td>
    </tr>
    <tr>
      <td><b>Seamless Integration:</b> Integrates effortlessly with state-of-the-art models like FLUX-Kontext and Qwen-Image-Edit with minimal additional training.</td>
      <td></td>
    </tr>
    <tr>
      <td><b>Superior Performance:</b> Achieves substantial improvements in edit controllability, visual consistency, user steerability, smoother edit trajectories, and better identity preservation compared to baselines.</td>
      <td></td>
    </tr>
    <tr>
      <td><b>Novel Loss Function:</b> Introduces the Partial Prompt Suppression (PPS) loss, an intuitive, interpretable, and easy-to-optimize objective for training instruction-aware adapters.</td>
      <td></td>
    </tr>
    <tr>
      <td><b>Versatile Applications:</b> Supports complex multi-object scene manipulations, zero-shot personalization, and enables narrative-like visual editing sequences.</td>
      <td></td>
    </tr>
  </tbody>
</table>
```<br><br>

**[ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents](https://arxiv.org/pdf/2511.07685)**<br>```html
<table border="1">
  <thead>
    <tr>
      <th>Strengths of the Paper / RESEARCHRUBRICS Benchmark</th>
      <th>Weaknesses of Current Deep Research Agents / Existing Benchmarks</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>
        <b>Comprehensive & High-Quality Benchmark:</b>
        <ul>
          <li>Human-crafted and reviewed prompts and rubrics (2,800+ hours of labor, 2,593 expert-written criteria) mitigate LLM-generated content biases.</li>
          <li>Features realistic, open-ended research tasks across nine diverse domains (e.g., AI/ML, historical analysis, consumer research).</li>
          <li>Includes fine-grained positive and negative rubrics to assess factual grounding, reasoning soundness, completeness, relevance, and clarity.</li>
        </ul>
      </td>
      <td>
        <b>Low Performance of State-of-the-Art Agents:</b>
        <ul>
          <li>Leading Deep Research (DR) agents (Gemini DR, OpenAI DR, Perplexity DR) achieve under 68% average rubric compliance, showing significant room for improvement.</li>
          <li>Performance degrades monotonically with increased task complexity, particularly logical nesting depth.</li>
          <li>Agents struggle with tasks requiring technical precision or systematic execution, sometimes favoring creative synthesis.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td>
        <b>Advanced Evaluation Framework:</b>
        <ul>
          <li>Introduces a novel task complexity framework (conceptual breadth, logical nesting, exploration) for categorizing and analyzing DR tasks.</li>
          <li>Separates mandatory from optional rubric criteria, distinguishing between minimum viability and excellent performance.</li>
          <li>Proposes a ternary grading scheme (Satisfied, Partially Satisfied, Not Satisfied) for nuanced evaluation and partial credit.</li>
          <li>Validated LLM-as-judge setup achieves substantial human-LLM alignment (up to 0.76 Macro F1 for binary grading).</li>
        </ul>
      </td>
      <td>
        <b>Specific Agent Failure Modes:</b>
        <ul>
          <li>Implicit reasoning and synthesis of information account for 45-50% of all agent failures.</li>
          <li>Agents demonstrate inadequate reasoning about retrieved information and frequently miss implicit context or unstated requirements.</li>
          <li>Struggle to integrate multi-document evidence into coherent, well-justified arguments with proper citations.</li>
          <li>Exhibit a breadth-accuracy trade-off in citation practices; no system successfully balances comprehensive coverage with precision.</li>
          <li>Systematic failures in multi-hop reasoning and sustained sequential reasoning point to fundamental architectural limitations rather than mere implementation issues.</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td>
        <b>Robust Methodology & Practical Insights:</b>
        <ul>
          <li>A three-expert pipeline ensures meticulous prompt and rubric refinement, guaranteeing high-quality data.</li>
          <li>Ablation studies provide practical recommendations for rubric design, demonstrating that concise, human-authored examples are crucial, while LLM-based augmentation catastrophically degrades alignment.</li>
          <li>Analyzes the "length-quality conflation" problem, showing that while longer responses correlate with higher scores, the criterion-based scoring helps mitigate pure verbosity bias.</li>
        </ul>
      </td>
      <td>
        <b>Limitations of Existing Benchmarks:</b>
        <ul>
          <li>Many rely on static datasets, which are susceptible to data leakage and cannot adapt to dynamic information.</li>
          <li>Existing benchmarks often use non-expert or automated evaluation (e.g., coarse metrics, LLM-generated rubrics), raising concerns about circularity or anchoring bias.</li>
          <li>Some are too narrow in scope (e.g., specific academic domains) or focus on short, factual answers, failing to capture the long-form, multi-source synthesis required by DR.</li>
          <li>Generally lack the fine-grained evaluation granularity provided by RESEARCHRUBRICS.</li>
        </ul>
      </td>
    </tr>
  </tbody>
</table>
```<br><br>