In [1]:
pip install metagpt==0.5.2

Collecting metagpt==0.5.2
  Downloading metagpt-0.5.2-py3-none-any.whl (216 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.4/216.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp==3.8.4 (from metagpt==0.5.2)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting channels==4.0.0 (from metagpt==0.5.2)
  Downloading channels-4.0.0-py3-none-any.whl (28 kB)
Collecting faiss-cpu==1.7.4 (from metagpt==0.5.2)
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
Collecting lancedb==0.1.16 (from metagpt==0.5.2)
  Downloading lancedb-0.1.16-py3-none-any.whl (34 kB)
Collecting langchain==0.0.231 (from metagpt==0.

In [2]:
pip install aiocron discord

Collecting aiocron
  Downloading aiocron-1.8-py3-none-any.whl (4.8 kB)
Collecting discord
  Downloading discord-2.3.2-py3-none-any.whl (1.1 kB)
Collecting croniter (from aiocron)
  Downloading croniter-2.0.1-py2.py3-none-any.whl (19 kB)
Collecting discord.py>=2.3.2 (from discord)
  Downloading discord.py-2.3.2-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: croniter, discord.py, aiocron, discord
Successfully installed aiocron-1.8 croniter-2.0.1 discord-2.3.2 discord.py-2.3.2


In [3]:
import aiohttp
import asyncio
from bs4 import BeautifulSoup

async def fetch_html(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

async def parse_main_page(html):
    title_list = []
    href_list = []
    soup = BeautifulSoup(html, 'html.parser')
    # 更新查找标签的逻辑以匹配当前网页结构
    title_tags = soup.find_all('h3', class_='mb-1 text-lg font-semibold leading-[1.2] hover:underline peer-hover:underline md:text-2xl')
    for title_tag in title_tags:
        a_tag = title_tag.find('a')  # 标题内的<a>标签
        if a_tag:
            title = a_tag.text.strip()  # 清除空白字符得到标题文本
            href = a_tag['href']  # 提取href属性
            title_list.append(title)  # 添加标题到列表
            href_list.append(href)  # 添加链接到列表
    return title_list, href_list

async def parse_sub_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    abstract = soup.find('div', class_="pb-8 pr-4 md:pr-16").p.text
    arxiv_url = soup.find('a', class_="btn inline-flex h-9 items-center", href=True)['href']
    return abstract, arxiv_url

async def main():
    url = 'https://huggingface.co/papers'
    base_url = 'https://huggingface.co'
    repositories = []
    try:
        html = await fetch_html(url)
        title_list, href_list = await parse_main_page(html)

        for title, href in zip(title_list, href_list):
            repo_info = {}
            repo_info['title'] = title
            # repo_info['href'] = href
            repositories.append(repo_info)
            # print(title, href)
            sub_html = await fetch_html(base_url + href)
            abstract, arxiv_url = await parse_sub_page(sub_html)
            # print(abstract, arxiv_url)
            repo_info['abstract'] = abstract
            repo_info['arxiv_url'] = arxiv_url
            repositories.append(repo_info)
        return repositories
    except Exception as e:
        print(f"An error occurred: {e}")




In [4]:
repositories = await main()
repositories

[{'title': 'DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference',
  'abstract': "The deployment and scaling of large language models (LLMs) have become\ncritical as they permeate various applications, demanding high-throughput and\nlow-latency serving systems. Existing frameworks struggle to balance these\nrequirements, especially for workloads with long prompts. This paper introduces\nDeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and\ngeneration composition strategy, to deliver up to 2.3x higher effective\nthroughput, 2x lower latency on average, and up to 3.7x lower (token-level)\ntail latency, compared to state-of-the-art systems like vLLM. We leverage a\nsynergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an\nefficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced\nimplementation supports a range of models and offers both non-persistent and\npersistent deployment opt

## Action: CrawlHuggingfaceDailyPaper

In [6]:
from metagpt.actions.action import Action
from metagpt.config import CONFIG

class CrawlHuggingfaceDailyPaper(Action):
    """
    This class specifically targets the daily papers section of the Huggingface website.
    Its main functionality includes asynchronously fetching and parsing the latest research papers
    published on Huggingface, extracting relevant details such as titles, abstracts, and arXiv URLs.
    It can be utilized in applications where up-to-date research information from Huggingface
    is required, making it a valuable tool for researchers and developers in AI and machine learning.
    """

    async def run(self, url: str = "https://huggingface.co/papers"):
        async with aiohttp.ClientSession() as client:
            async with client.get(url, proxy=CONFIG.global_proxy) as response:
                response.raise_for_status()
                html = await response.text()

        title_list, href_list = await parse_main_page(html)

        repositories = []
        base_url = 'https://huggingface.co'

        for title, href in zip(title_list, href_list):
            repo_info = {'title': title}
            sub_html = await fetch_html(base_url + href)
            abstract, arxiv_url = await parse_sub_page(sub_html)
            repo_info['abstract'] = abstract
            repo_info['arxiv_url'] = arxiv_url

            repositories.append(repo_info)

        return repositories


In [7]:
craw_paper_action = CrawlHuggingfaceDailyPaper()
resp = await craw_paper_action.run()
resp

[{'title': 'DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference',
  'abstract': "The deployment and scaling of large language models (LLMs) have become\ncritical as they permeate various applications, demanding high-throughput and\nlow-latency serving systems. Existing frameworks struggle to balance these\nrequirements, especially for workloads with long prompts. This paper introduces\nDeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and\ngeneration composition strategy, to deliver up to 2.3x higher effective\nthroughput, 2x lower latency on average, and up to 3.7x lower (token-level)\ntail latency, compared to state-of-the-art systems like vLLM. We leverage a\nsynergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an\nefficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced\nimplementation supports a range of models and offers both non-persistent and\npersistent deployment opt

### Action: SummaryDailyPaper

summary each daily paper, add five keywords by LLM

In [57]:
from typing import Any
PAPER_SUMMARY_PROMPT = """
    Transform the given data about a research paper into a neat Markdown format. Also, identify and include five relevant keywords that best represent the core themes of the paper.
    Don't forget to include the title, abstract, and arXiv URL.
    The provided data is:
    ```
    {data}
    ```
    Please create a markdown summary and suggest five keywords related to this paper, as well as the title, abstract, and arXiv URL.
    """
class SummaryDailyPaper(Action):
    def __init__(self, data: Any):
        super().__init__(data)
        self.data = data

    async def run(
        self
    ):
        return await self._aask(PAPER_SUMMARY_PROMPT.format(data=self.data))

In [58]:
await SummaryDailyPaper(resp[0]).run()

# DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

**Abstract:** The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diver

"# DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference\n\n**Abstract:** The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to di

## Role: DailyPaperWatcher

for analyze huggingfacce daily papers, and summary.

In [22]:

from typing import Dict, List
from metagpt.utils.common import OutputParser
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.logs import logger

class DailyPaperWatcher(Role):
    def __init__(
        self,
        name="Huggy",
        profile="DailyPaperWatcher",
        goal="Generate a summary of Huggingface daily papers.",
        constraints="Only analyze based on the provided Huggingface daily papers.",
    ):
        super().__init__(name, profile, goal, constraints)
        self._init_actions([CrawlHuggingfaceDailyPaper])
        self._set_react_mode(react_mode="by_order")


    async def _act(self) -> Message:
        logger.info(f"{self._setting}: ready to {self._rc.todo}")

        todo = self._rc.todo

        try:
            msg = self.get_memories(k=1)[0]
        except IndexError:
            logger.error("No messages in memory")
            return Message(content="Error: No messages in memory", role=self.profile)

        try:
            result = await todo.run(msg.content)
            if isinstance(todo, CrawlHuggingfaceDailyPaper):
                # 针对每篇论文创建并执行 SummaryDailyPaper 动作
                logger.info(f"Preparing to summarize {len(result)} papers")
                msg_content = ''
                for paper in result:
                    summary_action = SummaryDailyPaper(paper)
                    summary_result = await summary_action.run(paper)
                    summary_msg = Message(content=str(summary_result), role=self.profile, cause_by=type(summary_action))
                    self._rc.memory.add(summary_msg)
                    msg_content += str(summary_result)
                    msg_content += '\n'

            else:
                msg = Message(content=str(result), role=self.profile, cause_by=type(todo))
                self._rc.memory.add(msg)

        except Exception as e:
            logger.error(f"Error during action execution: {e}")
            return Message(content=f"Error: {e}", role=self.profile)

        return Message(content=str(msg_content), role=self.profile, cause_by=type(todo))



    # async def _handle_paper(self, paper_info) -> None:
    #     actions = []
    #     # Enhanced logging for debugging
    #     logger.debug(f"Handling paper with info: {paper_info}")

    #     for paper in paper_info:
    #         actions.append(SummaryDailyPaper(paper))
    #         logger.info(f"Preparing to summarize paper: {paper['title']}")

    #     self._init_actions(actions)
    #     self._rc.todo = None


In [92]:
from typing import Dict, List
from metagpt.utils.common import OutputParser
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.logs import logger
from datetime import datetime

class DailyPaperWatcher(Role):
    def __init__(
        self,
        name="Huggy",
        profile="DailyPaperWatcher",
        goal="Generate a summary of Huggingface daily papers.",
        constraints="Only analyze based on the provided Huggingface daily papers.",
    ):
        super().__init__(name, profile, goal, constraints)
        self._init_actions([CrawlHuggingfaceDailyPaper])
        self.tot_content = ""

    async def _act(self) -> Message:
        logger.info(f"{self._setting}: ready to {self._rc.todo}")

        todo = self._rc.todo
        if type(todo) is CrawlHuggingfaceDailyPaper:
            msg = self._rc.memory.get(k=1)[0]

            resp = await todo.run()
            logger.info(resp)
            return await self._handle_paper(resp)

        resp = await todo.run()
        logger.info(resp)

        if self.tot_content != "":
            self.tot_content += "\n\n\n"
        self.tot_content += resp
        return Message(content=resp, role=self.profile)


    async def _think(self) -> None:
        """Determine the next action to be taken by the role."""
        if self._rc.todo is None:
            self._set_state(0)
            return

        if self._rc.state + 1 < len(self._states):
            self._set_state(self._rc.state + 1)
        else:
            self._rc.todo = None

    async def _react(self) -> Message:
        """Execute the assistant's think and actions."""
        while True:
            await self._think()
            if self._rc.todo is None:
                break
            msg = await self._act()

        # return msg
        return Message(content=self.tot_content, role=self.profile)

    async def _handle_paper(self, paper_info) -> None:
        actions = []
        # Enhanced logging for debuggingself
        logger.debug(f"Handling paper with info: {paper_info}")
        self.tot_content += f"# Huggingface Daily Paper: {datetime.now().strftime('%Y-%m-%d')}"

        for paper in paper_info:
            # print(paper)
            actions.append(SummaryDailyPaper(paper))
            # logger.info(f"Preparing to summarize paper: {paper['title']}")

        self._init_actions(actions)
        self._rc.todo = None
        return Message(content="init", role=self.profile)



In [93]:
async def main():

    role = DailyPaperWatcher()
    result = await role.run("https://huggingface.co/papers")
    logger.info(result)
    return result

result = await main()

2024-01-18 06:27:55.727 | INFO     | __main__:_act:21 - Huggy(DailyPaperWatcher): ready to CrawlHuggingfaceDailyPaper
2024-01-18 06:27:59.120 | INFO     | __main__:_act:28 - [{'title': 'Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model', 'abstract': 'Recently the state space models (SSMs) with efficient hardware-aware designs,\ni.e., Mamba, have shown great potential for long sequence modeling. Building\nefficient and generic vision backbones purely upon SSMs is an appealing\ndirection. However, representing visual data is challenging for SSMs due to the\nposition-sensitivity of visual data and the requirement of global context for\nvisual understanding. In this paper, we show that the reliance of visual\nrepresentation learning on self-attention is not necessary and propose a new\ngeneric vision backbone with bidirectional Mamba blocks (Vim), which marks the\nimage sequences with position embeddings and compresses the visual\nrepresentation wi

# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, specifically Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity and the requirement of global context. In this paper, the authors propose a new generic vision backbone called Vim, which utilizes bidirectional Mamba blocks to compress visual representation. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-resolution images. The results show that Vim overcomes computation and memory constraints, making it a promising backbone for vision foundation models.

**Keywords:** Vision Mamba, St

2024-01-18 06:28:02.900 | INFO     | __main__:_act:32 - # Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, specifically Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity and the requirement of global context. In this paper, the authors propose a new generic vision backbone called Vim, which utilizes bidirectional Mamba blocks to compress visual representation. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-resolution images. The results show that Vim overcomes computation and memory constraints, making it a promising backbone for v

 Efficiency

[arXiv Link](https://arxiv.org/abs/2401.09417)
# DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

**Abstract:** The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-per

2024-01-18 06:28:06.882 | INFO     | __main__:_act:32 - # DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

**Abstract:** The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persist


# UniVG: Towards UNIfied-modal Video Generation

**Abstract:** Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Generation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ M

2024-01-18 06:28:12.248 | INFO     | __main__:_act:32 - # UniVG: Towards UNIfied-modal Video Generation

**Abstract:** Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Generation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation cat

 cross attention, biased Gaussian noise

**arXiv URL:** [https://arxiv.org/abs/2401.09084](https://arxiv.org/abs/2401.09084)
# Asynchronous Local-SGD Training for Language Modeling

**Abstract:** Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of *asynchronous* Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the glob

2024-01-18 06:28:15.643 | INFO     | __main__:_act:32 - # Asynchronous Local-SGD Training for Language Modeling

**Abstract:** Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of *asynchronous* Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We 

 Local-SGD, language modeling, worker hardware heterogeneity, delayed Nesterov momentum update

**arXiv URL:** [https://arxiv.org/abs/2401.09135](https://arxiv.org/abs/2401.09135)
# ReFT: Reasoning with Reinforced Fine-Tuning

**Abstract:** One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT fi

2024-01-18 06:28:20.540 | INFO     | __main__:_act:32 - # ReFT: Reasoning with Reinforced Fine-Tuning

**Abstract:** One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this pape

, Math problem-solving

[arXiv Link](https://arxiv.org/abs/2401.08967)
# VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

**Abstract:**
Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal mod

2024-01-18 06:28:24.428 | INFO     | __main__:_act:32 - # VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

**Abstract:**
Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video m

 generation, video models, high-quality videos, low-quality videos, synthesized high-quality images

[arXiv link](https://arxiv.org/abs/2401.09047)
# SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

**Abstract:** We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact s

2024-01-18 06:28:28.116 | INFO     | __main__:_act:32 - # SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

**Abstract:** We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients,

, Scalable Interpolant Transformers, modular study, dynamical transport

**arXiv URL:** [https://arxiv.org/abs/2401.08740](https://arxiv.org/abs/2401.08740)
# GARField: Group Anything with Radiance Fields

**Abstract:** Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this fi

2024-01-18 06:28:33.693 | INFO     | __main__:_act:32 - # GARField: Group Anything with Radiance Fields

**Abstract:** Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interacti

. See the project website at [https://www.garfield.studio/](https://www.garfield.studio/)

**Keywords:** Grouping, Radiance Fields, 3D scenes, Hierarchy, Semantically meaningful groups
# SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

**Abstract:** 
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of s

2024-01-18 06:28:40.199 | INFO     | __main__:_act:32 - # SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

**Abstract:** 
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-langua

, embodied agents, 3D scenes, grounded learning, vision-language dataset. 

**arXiv URL:** [https://arxiv.org/abs/2401.09340](https://arxiv.org/abs/2401.09340)
# Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

**Abstract:** Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolut

2024-01-18 06:28:45.178 | INFO     | __main__:_act:32 - # Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

**Abstract:** Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidan

 semantics

**ArXiv URL:** [https://arxiv.org/abs/2401.09048](https://arxiv.org/abs/2401.09048)
# ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization

**Abstract:** Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces "confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-conf

2024-01-18 06:28:49.409 | INFO     | __main__:_act:32 - # ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization

**Abstract:** Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces "confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF)

 Frames

[arXiv Link](https://arxiv.org/abs/2401.08937)
# TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion

**Abstract:** We present TextureDreamer, a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry, while learning-based methods are confined to category-specific shapes within the dataset. In contrast, TextureDreamer can transfer highly detailed, intricate textures from real-world environments to arbitrary objects with only a few casually captured images, potentially significantly democratizing texture creation. Our core idea, personalized geometry-aware score distillation (PGSD), draws inspiration from rece

2024-01-18 06:28:53.407 | INFO     | __main__:_act:32 - # TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion

**Abstract:** We present TextureDreamer, a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry, while learning-based methods are confined to category-specific shapes within the dataset. In contrast, TextureDreamer can transfer highly detailed, intricate textures from real-world environments to arbitrary objects with only a few casually captured images, potentially significantly democratizing texture creation. Our core idea, personalized geometry-aware score distillation (PGSD), draws inspiration from rece

-guided, Geometry-aware diffusion, Texture creation, Relightable textures.

[arXiv link](https://arxiv.org/abs/2401.09416)


In [70]:
current_time = datetime.now()
current_time

datetime.datetime(2024, 1, 18, 6, 10, 37, 736240)

## Trigger

In [83]:
import time
from aiocron import crontab
from typing import Optional
from pytz import BaseTzInfo
from pydantic import BaseModel, Field
from metagpt.schema import Message

class DailyPaperInfo(BaseModel):
    url: str
    timestamp: float = Field(default_factory=time.time)



class HuggingfaceDailyPaperCronTrigger():

    def __init__(self, spec: str, tz: Optional[BaseTzInfo] = None, url: str = "https://huggingface.co/papers") -> None:
        self.crontab = crontab(spec, tz=tz)
        self.url = url

    def __aiter__(self):
        return self

    async def __anext__(self):
        await self.crontab.next()
        return Message(self.url, DailyPaperInfo(url=self.url))


## Callback

In [84]:
### Discord
from google.colab import userdata

TOKEN = userdata.get('DISCORD_TOKEN')
CHANNEL_ID = userdata.get('DISCORD_CHANNEL_ID')

In [85]:
# callback
import os
import discord
async def discord_callback(msg: Message):
    intents = discord.Intents.default()
    intents.message_content = True
    intents.members = True

    client = discord.Client(intents=intents)
    token = TOKEN
    channel_id = int(CHANNEL_ID)

    async with client:
        await client.login(token)
        channel = await client.fetch_channel(channel_id)
        lines = []
        for i in msg.content.splitlines():
            if i.startswith(("# ", "## ", "### ")):
                if lines:
                    await channel.send("\n".join(lines))
                    lines = []
            lines.append(i)

        if lines:
            await channel.send("\n".join(lines))

## Main

In [86]:
from metagpt.subscription import SubscriptionRunner
# 运行入口，
async def main(spec: str = "54 16 * * *", discord: bool = True, wxpusher: bool = False):
    callbacks = []
    if discord:
        callbacks.append(discord_callback)

    if wxpusher:
        callbacks.append(wxpusher_callback)

    if not callbacks:
        async def _print(msg: Message):
            print(msg.content)
        callbacks.append(_print)

    async def callback(msg):
        await asyncio.gather(*(call(msg) for call in callbacks))

    runner = SubscriptionRunner()
    await runner.subscribe(DailyPaperWatcher(), HuggingfaceDailyPaperCronTrigger(spec), callback)
    await runner.run()

In [87]:
from pytz import timezone
from datetime import datetime, timedelta

current_time = datetime.now()
target_time = current_time + timedelta(minutes=1)
cron_expression = target_time.strftime('%M %H %d %m %w')
print(cron_expression)

18 06 18 01 4


In [88]:
await main(cron_expression)

2024-01-18 06:18:00.002 | INFO     | __main__:_act:21 - Huggy(DailyPaperWatcher): ready to CrawlHuggingfaceDailyPaper
2024-01-18 06:18:02.998 | INFO     | __main__:_act:28 - [{'title': 'Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model', 'abstract': 'Recently the state space models (SSMs) with efficient hardware-aware designs,\ni.e., Mamba, have shown great potential for long sequence modeling. Building\nefficient and generic vision backbones purely upon SSMs is an appealing\ndirection. However, representing visual data is challenging for SSMs due to the\nposition-sensitivity of visual data and the requirement of global context for\nvisual understanding. In this paper, we show that the reliance of visual\nrepresentation learning on self-attention is not necessary and propose a new\ngeneric vision backbone with bidirectional Mamba blocks (Vim), which marks the\nimage sequences with position embeddings and compresses the visual\nrepresentation wi

# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-resolution images. The results show that Vim overcomes computation and memory constraints and has the poten

2024-01-18 06:18:06.115 | INFO     | __main__:_act:32 - # Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-resolution images. The results show that Vim overco


# Huggingface Daily Paper: 2024-01-18 <class 'str'>
# DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

**Abstract:** The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent

2024-01-18 06:18:10.764 | INFO     | __main__:_act:32 - # DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

**Abstract:** The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persist

 and contribution.

**Keywords:** Language models, Text generation, High-throughput, Low-latency, DeepSpeed-FastGen

[arXiv Link](https://arxiv.org/abs/2401.08671)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim 

2024-01-18 06:18:16.115 | INFO     | __main__:_act:32 - # UniVG: Towards UNIfied-modal Video Generation

**Abstract:** Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Generation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation cat

 Gaussian noise

**arXiv URL:** [https://arxiv.org/abs/2401.09084](https://arxiv.org/abs/2401.09084)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when p

2024-01-18 06:18:19.757 | INFO     | __main__:_act:32 - # Asynchronous Local-SGD Training for Language Modeling

**Abstract:** Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of *asynchronous* Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We 

, language modeling, distributed optimization, Nesterov momentum

[arXiv Link](https://arxiv.org/abs/2401.09135)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU m

2024-01-18 06:18:23.749 | INFO     | __main__:_act:32 - # ReFT: Reasoning with Reinforced Fine-Tuning

**Abstract:** One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this pape

, Math problem-solving

[arXiv Link](https://arxiv.org/abs/2401.08967)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on h

2024-01-18 06:18:27.166 | INFO     | __main__:_act:32 - # VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

**Abstract:** Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video m

 generation, video diffusion models, high-quality videos, low-quality videos, synthesized high-quality images

[arXiv link](https://arxiv.org/abs/2401.09047)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8

2024-01-18 06:18:31.083 | INFO     | __main__:_act:32 - # SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

**Abstract:** 
In this research paper, the authors introduce Scalable Interpolant Transformers (SiT), a family of generative models that are built on the foundation of Diffusion Transformers (DiT). SiT utilizes the interpolant framework, which provides a more flexible way of connecting two distributions compared to standard diffusion models. This framework allows for a modular study of different design choices that impact generative models built on dynamical transport. These design choices include discrete vs. continuous time learning, the objective for the model to learn, the choice of interpolant connecting the distributions, and the deployment of a deterministic or stochastic sampler. By carefully incorporating these elements, SiT outperforms DiT consistently across various model sizes on the conditional ImageNet 256x256 benchmar

usion Transformers, Interpolant framework, Dynamical transport, Conditional ImageNet

[arXiv Link](https://arxiv.org/abs/2401.08740)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT a

2024-01-18 06:18:36.790 | INFO     | __main__:_act:32 - # GARField: Group Anything with Radiance Fields

**Abstract:** Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interacti

 meaningful groups, Scale-conditioned 3D affinity feature field.
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-re

2024-01-18 06:18:39.443 | INFO     | __main__:_act:32 - # SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

**Abstract:** 
The paper focuses on the challenges of aligning language with the 3D physical environment, known as 3D vision-language grounding, in the development of embodied agents. It addresses the complexity of 3D scenes, the scarcity of paired 3D vision-language data, and the absence of a unified learning framework. The authors introduce SceneVerse, a million-scale 3D vision-language dataset, and propose a unified pre-training framework called Grounded Pre-training for Scenes (GPS). Through extensive experiments, they demonstrate the effectiveness of GPS in achieving state-of-the-art performance on existing 3D visual grounding benchmarks. The paper also presents zero-shot transfer experiments to showcase the potential of SceneVerse and GPS in challenging 3D vision-language tasks.

**Keywords:** 3D vision-language grounding, embodied agents, 3D

, embodied agents, 3D scenes, pre-training framework, visual grounding

**arXiv URL:** [https://arxiv.org/abs/2401.09340](https://arxiv.org/abs/2401.09340)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 t

2024-01-18 06:18:47.712 | INFO     | __main__:_act:32 - # Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

**Abstract:** Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three-dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidan

ations, and Global Stylistic Semantics.

For more details, you can refer to the [arXiv link](https://arxiv.org/abs/2401.09048).
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and sa

2024-01-18 06:18:51.319 | INFO     | __main__:_act:32 - # ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization

**Abstract:** Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces "confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF)

-from-Motion

[arXiv Link](https://arxiv.org/abs/2401.08937)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-resolu

2024-01-18 06:18:57.269 | INFO     | __main__:_act:32 - # TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion

**Abstract:** 
The paper presents TextureDreamer, a novel image-guided texture synthesis method that can transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a challenging task in vision and graphics, often requiring experienced artists to manually craft textures for 3D assets. Classical methods rely on densely sampled views and accurately aligned geometry, while learning-based methods are limited to category-specific shapes within the dataset. In contrast, TextureDreamer can transfer highly detailed, intricate textures from real-world environments to arbitrary objects with just a few casually captured images, potentially democratizing texture creation. The core idea behind TextureDreamer is personalized geometry-aware score distillation (PGSD), which combines

 synthesis, Image-guided, Geometry-aware, Relightable textures, Personalized modeling.

[arXiv Link](https://arxiv.org/abs/2401.09416)
# Huggingface Daily Paper: 2024-01-18


# Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

**Abstract:** Recently, state space models (SSMs) with efficient hardware-aware designs, known as Mamba, have shown great potential for long sequence modeling. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, the authors propose a new generic vision backbone called Vim, which combines bidirectional Mamba blocks with position embeddings to compress visual representations. Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation and memory efficiency. For example, Vim is 2.8 times faster than DeiT

CancelledError: 