Skip to content

SalvatoreRa/ML-news-of-the-week

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

ML news of the week

A collection of the best ML news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence

For each week you will find different sections:

  • Research: the most important published research of the week.
  • News: the most important news related to companies, institutions, and much more.
  • Resources: released resources for artificial intelligence and machine learning.
  • Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

2023

Back to index

2024

ML news: Week 13 - 19 May

Research

Link description
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Separately trained tokenizers are necessary for language models. Tokens that are never encountered during language model training may be produced by these. Even the most potent contemporary language models have a lot. This study investigates this phenomenon and offers solutions for locating and handling these tokens.
Unlearning in Recommender Systems. With the use of a novel technique called E2URec, huge language model-based recommendation systems may now effectively and efficiently forget user data while maintaining privacy and speed.
Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. A project called Lumina seeks to provide a single text-to-X generation mechanism. Its training process involves interleaving text, video, audio, and pictures, which enhances downstream performance.
MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures, and Pressures. In AI, simulators can be very effective tools for gathering training data or facilitating interactions between models. A wide range of elemental atomic interactions can be modeled with this simulator.
SGTR+: End-to-end Scene Graph Generation with Transformer. A new, more effective technique for producing scene graphs has been discovered by researchers. Their transformer-based approach aims to enhance the model's comprehension and interconnection of many parts in a picture, resulting in enhanced performance on complex tasks.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model. A vision-language model called InternLM-XComposer2 is very good at producing and comprehending intricate text-image information. It surpasses current approaches in multimodal content production and interpretation by introducing a Partial LoRA technique for a balanced vision and text comprehension.
MambaOut: Do We Really Need Mamba for Vision? While Mamba is not effective for image classification, it shows promise in detection and segmentation tasks that do. The Mamba architecture is often employed for tasks with long-sequence and autoregressive characteristics. Researchers looked into this design and its application in vision tasks.
State-Free Inference of State-Space Models: The Transfer Function Approach. For deep learning, a new state-space model with a dual transfer function representation has been created. A state-free sequence parallel inference approach is one of its features.
Learning A Spiking Neural Network for Efficient Image Deraining. A Spiking Neural Network (SNN) called ESDNet is intended for picture deraining applications. It increases spike signal strength by taking advantage of the special qualities of rain pixel values.
Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning. Making 3D models is difficult. A coarse mesh can be entered initially, and then the generation process can be carried out, giving users more precise control and higher-quality model output.
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. Particularly for Chinese and English, the recently created Hunyuan-DiT establishes a standard for text-to-image diffusion transformers. It has a sophisticated data pipeline and transformer structures for ongoing model enhancement.
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance. A method to improve the quality of images produced by diffusion models without extra training or external modules is called Perturbed-Attention Guidance (PAG). PAG leads to a significant improvement in the structure and fidelity of both unconditional and conditional samples by innovative manipulation of the self-attention mechanisms within the model.
SqueezeTime. SqueezeTime is a new lightweight network that enhances temporal analysis by condensing the time axis of movies into the channel dimension, specifically for mobile video understanding.

News

Link description
OpenAI confirms May 13 event for ‘some ChatGPT and GPT-4 updates’. Following a report that the company plans to launch a Google Search competitor next week, OpenAI has just confirmed a May 13 event for new “ChatGPT and GPT-4” updates.
Bye-bye bots: Altera’s game-playing AI agents get backing from Eric Schmidt. Autonomous, AI-based players are coming to a gaming experience near you, and a new startup, Altera, is joining the fray to build this new guard of AI agents.
BLIP3. Salesforce has trained and released the 3rd non-commercial version of the popular BLIP models, vision, and language models mainly used for image understanding and captioning.
Asterisk/Zvi on California's AI Bill. Regulations on AI models with a processing capacity of more than 10^26 FLOPs are proposed by the California SB1047 law. By demanding secure surroundings, quick deactivation capabilities, and thorough misuse possibility testing, it focuses on ensuring these models are used securely. The measure aims to address worries about the possible impact of AI on society by balancing innovation with safeguards against exploitation, and it only targets high-risk scenarios.
Bedrock Studio is Amazon’s attempt to simplify generative AI app development. Amazon is launching a new tool, Bedrock Studio, designed to let organizations experiment with generative AI models, collaborate on those models, and ultimately build generative AI-powered apps.
New GPT-4o AI model is faster and free for all users, OpenAI announces. Tech company reveals new flagship model that ‘is the future of interaction between ourselves and the machines’
Introducing GPT-4o and more tools to ChatGPT free users. Today we are introducing our newest model, GPT-4o, and will be rolling out more intelligence and advanced tools to ChatGPT for free.
Open sourcing IBM’s Granite code models. In order to make coding across several platforms easier and more efficient, IBM is making its Granite code models—which span a range of programming activities and have between 3 and 34 billion parameters—available to the open-source community.
Bloomberg: Apple finalizing a deal with OpenAI to bring ChatGPT features to iOS 18. Apple is finalizing an agreement with OpenAI to bring some of its technology to the iPhone this year, according to a new report from Bloomberg. With this deal, the report explains that Apple will be able to offer “a popular chatbot” powered by ChatGPT as part of its AI-focused features in iOS 18.
OpenAI says it can now identify images generated by OpenAI — mostly. The company said its new tool correctly identified 98% of images generated by DALL-E 3
Microsoft is ‘turning everyone into a prompt engineer’ with new Copilot AI features. Copilot for Microsoft 365 is getting auto-complete, rewrite, and more to improve AI prompts.
Gemini breaks new ground with a faster model, longer context, AI agents, and more. At I/O 2024, Google unveiled a slew of new features, including Imagen 3, Veo video creation, Gemini Flash, and Project Astra, its newest assistant. Among the many noteworthy enhancements are the 2 million token context duration, significantly reduced model costs, and enhanced multimodality.
Anthropic is expanding to Europe and raising more money. Anthropic said Monday that Claude, its AI assistant, is now live in Europe with support for “multiple languages,” including French, German, Italian, and Spanish across Claude.ai, its iOS app, and its business plan for teams.
Elon Musk's xAI nears $10 bln deal to rent Oracle's AI servers, The Information reports. - Elon Musk's artificial intelligence startup xAI has been talking to Oracle (ORCL.N), opens new tab executives about spending $10 billion to rent cloud servers from the company over a period of years, The Information reported on Tuesday, citing a person involved in the talks.
OpenAI co-founder who had a key role in the attempted firing of Sam Altman departs. Ilya Sutskever helped orchestrate dramatic firing and rehiring of ChatGPT maker’s CEO last year
Google rolls out AI-generated, summarized search results in US. Tech giant also reveals AI assistant in progress, currently called Project Astra, and AI video generator Veo at annual I/O conference
OpenAI chief scientist Ilya Sutskever is officially leaving. Ilya Sutskever, OpenAI’s co-founder and chief scientist who helped lead the infamous failed coup against Sam Altman and then later changed his mind, is officially leaving the company.
Project IDX, Google’s next-gen IDE, is now in open beta. At it’s annual Google I/O 2024 developer conference on Tuesday, Google announced that Project IDX, the company’s next-gen, AI-centric browser-based development environment, is now in open beta. The company first launched it as an invite-only service gated by a waitlist in August.
Researchers build AI-driven sarcasm detector. Being able to detect the lowest form of wit could help AI interact with people more naturally, say scientists
Hugging Face is sharing $10 million worth of computing to help beat the big AI companies. ZeroGPU gives everyone the chance to create AI apps without the burden of GPU costs.
OpenAI partners with Reddit to integrate unique user-generated content into ChatGPT. Reddit, the widely popular social news aggregation and discussion platform, and OpenAI, the renowned AI research laboratory, have announced a strategic partnership that promises to revolutionize the way users interact with online communities and experience AI-powered features.
Meta is reportedly working on camera-equipped AI earphones. The company believes earphones are the future of AI-wearable technology.
Cursor's instant full file edits with speculative editing. Using a bespoke Llama 3 70B model with a speculative prior, the researchers were able to rewrite files almost instantly at a rate of 1,000 tokens per second. They achieved this with some creative output formatting and no diffs.
Improvements to data analysis in ChatGPT. Interact with tables and charts and add files directly from Google Drive and Microsoft OneDrive.

Resources

Link description
ThunderKittens CUDA DSL. Hazy research has unveiled a novel DSL for CUDA kernel development. Only 100 lines of code are needed to implement its 30% quicker written flash attention feature.
AnythingLLM. A full-stack application that enables you to turn any document, resource, or piece of content into a context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
Mirage: A Multi-level Superoptimizer for Tensor Algebra. Mirage is a tensor algebra superoptimizer that automatically discovers highly optimized tensor programs for DNNs. Mirage automatically identifies and verifies sophisticated optimizations, many of which require joint optimization at the kernel, thread block, and thread levels of the GPU compute hierarchy. For an input DNN, Mirage searches the space of potential tensor programs that are functionally equivalent to the DNN to discover highly optimized candidates. This approach allows Mirage to find new custom kernels that outperform existing expert-designed ones.
audio-diffusion-pytorch. A fully featured audio diffusion library, for PyTorch. Includes models for unconditional audio generation, text-conditional audio generation, diffusion autoencoding, upsampling, and vocoding. The provided models are waveform-based, however, the U-Net (built using a-UNet), DiffusionModel, diffusion method, and diffusion samplers are both generic to any dimension and highly customizable to work on other formats.
Pipecat. Pipecat is a framework for building voice (and multimodal) conversational agents. Things like personal coaches, meeting assistants, story-telling toys for kids, customer support bots, intake flows, and snarky social companions.
MRSegmentator: Robust Multi-Modality Segmentation of 40 Classes in MRI and CT Sequences. A novel tool called MRSegmentator has been developed to improve the segmentation of MRI scans. It can successfully detect 40 distinct organs and structures in the abdominal, pelvic, and thoracic areas.
Time-Evidence-Fusion-Network. A unique deep learning model called the Time-Evidence Fusion Network (TEFN) is intended to improve long-term time series forecasting. Information fusion and evidence theory are combined, and a specific module is used to increase prediction stability and accuracy.
moondream2-coyo-5M-captions. 5M novel captions based on the alt-text and images of a portion of the COYO dataset.
WebLlama. We are thrilled to release Llama-3-8B-Web, the most capable agent built with 🦙 Llama 3 and finetuned for web navigation with dialogue.
Ollama on Google Firebase. For Firebase, Genkit is a new toolkit for developing and implementing generative applications. Open-source language model servers can be launched with it.
Finetune PaliGemma. This notebook shows how to finetune PaliGemma on a vision-language task. The training data consists of 90 pairs of images and long captions describing them. To make it runnable on a T4 colab runtime with 16GB HBM and 12GB RAM, we opt to only finetune the attention layers of the language model and freeze the other parameters.
Gemini Flash. Google has released a new lightweight model called Gemini Flash, which has a lengthy context window of up to one million tokens and multimodal reasoning.
DeepMind Veo. Google Deepmind has released Veo, a new AI model for creating videos that can produce more than one minute in 1080p HD.
IC-Light. IC-Light is a project to manipulate the illumination of images.
EfficientTrain++. With ImageNet databases, EfficientTrain++ presents a revolutionary curriculum learning technique that can drastically cut the training periods of popular visual models like ResNet and Swin by up to three times.
NousResearch/Hermes-2-Theta-Llama-3-8B. Hermes-2 Θ is a merged and then further RLHF'ed version our excellent Hermes 2 Pro model and Meta's Llama-3 Instruct model to form a new model, Hermes-2 Θ, combining the best of both worlds of each model.
Energy-based Hopfield Boosting for Out-of-Distribution Detection. A method called Hopfield Boosting makes use of contemporary Hopfield energy to improve machine learning models' ability to recognize out-of-distribution (OOD) data.
OpenAI’s custom GPT Store is now open to all for free. OpenAI is making a number of its previously subscription-only features available to free users of ChatGPT, with the biggest being the ability to browse its GPT Store and use custom bots, said CTO Mira Murati during the company’s Spring update livestream today. The company also published today’s updates in a blog on its website.
llama3.np. llama3.np is pure NumPy implementation for Llama 3 model. For an accurate implementation, I ran the stories15M model trained by Andrej Karpathy.

Perspectives

Link description
ChatGPT and the like could free up coders to new heights of creativity. Far from making programmers an endangered species, AI will release them from the grunt work that stifles innovation
Superhuman? Top AI labs are focused on achieving artificial general intelligence (AGI), with estimates for its realization ranging from 2027 to 2047. Even though AI hasn't yet reached artificial general intelligence (AGI), certain systems exhibit superhuman abilities in particular tasks, indicating that AI's optimum use right now is as a co-intelligence that complements human efforts rather than replaces them.
Large language models (e.g., ChatGPT) as research assistants. Artificial intelligence (AI) systems, such as GPT-4, are helping and even surpassing academics in tasks like producing research articles. According to Liang et al., AI is used in up to 18% of publications in some domains. This AI integration may result in a cycle where academic publications are produced and reviewed by software. The effect on scientific advancement is complex, though; while it may allow for more production, there is also a chance that more research will be done during an era in which knowledge will be less.
What OpenAI did. The integration of voice and vision in GPT-4o's multimodal skills holds great potential for improving AI's ability to interact with the outside world and laying the groundwork for AI to become a more commonplace presence in day-to-day life.
OpenAI’s new GPT-4o model offers promise of improved smartphone assistants. System can operate directly in speech, speeding up responses and noticing voice quirks, but it still needs the power of Siri
Why mathematics is set to be revolutionized by AI. Cheap data and the absence of coincidences make maths an ideal testing ground for AI-assisted discovery — but only humans will be able to tell good conjectures from bad ones.
Major AlphaFold upgrade offers boost for drug discovery. The latest version of the AI models how proteins interact with other molecules — but DeepMind restricts access to the tool.
Lethal AI weapons are here: how can we control them? Autonomous weapons guided by artificial intelligence are already in use. Researchers, legal experts, and ethicists are struggling with what should be allowed on the battlefield.
AI spending grew 293% last year. Here's how companies are using AI to stay ahead. According to Ramp's Q1 data, its clients' expenditure on AI has increased by 293% year over year, surpassing the rise of all software investment. AI is also being widely used in non-tech businesses including financial services and healthcare, suggesting a wider integration of AI across a range of industries. Even though there is a general slowdown in new investments in AI, businesses that are already utilizing the technology are doubling down. The average amount spent on AI tools has climbed by 138% year over year, and businesses are still cautious when it comes to travel expenses.
AI Copilots Are Changing How Coding Is Taught. Professors are shifting away from syntax and emphasizing higher-level skills
Test Driving ChatGPT-4o. Inspired by ChatGPT vs Math (2023), let’s see how ChatGPT-4o performs.
As the AI world gathers in Seoul, can an accelerating industry balance progress against safety? Companies such as OpenAI and Meta push ahead, but it is clear that biggest changes are yet to come

meme-of-the-week

Back to index

ML news: Week 6 - 12 May

Research

Link description
Mantis: Interleaved Multi-Image Instruction Tuning. A newly developed dataset and trained visual language model that allow for better instruction over a series of images.
FeNNol: an Efficient and Flexible Library for Building Force-field-enhanced Neural Network Potentials. A state-of-the-art library called FeNNol makes it easier to create and use hybrid neural network potentials in molecular simulations.
Spider: A Unified Framework for Context-dependent Concept Understanding. Spider is a revolutionary unified paradigm intended to improve comprehension of context-dependent (CD) concepts that rely largely on visual context, like medical lesions and items concealed in the environment.
Frequency-mixed Single-source Domain Generalization for Medical Image Segmentation. A novel algorithm known as RaffeSDG has been created by researchers to enhance the precision of medical imaging models when evaluating data from various sources.
SlotGAT: Slot-based Message Passing for Heterogeneous Graph Neural Network. SlotGAT is a new approach that improves heterogeneous graph neural networks by addressing the semantic mixing issue in traditional message passing.
Frequency Masking for Universal Deepfake Detection. By concentrating on masked picture modeling, particularly in the frequency domain, this novel technique finds deepfakes. The strategy is different from conventional approaches and demonstrates a notable improvement in recognizing artificial images, even from recently developed AI generative techniques.
Auto-Encoding Morph-Tokens for Multimodal LLM. Researchers have created "Morph-Tokens" to enhance AI's capacity for image creation and visual comprehension. These tokens take advantage of the sophisticated processing capabilities of the MLLM framework to convert abstract notions required for comprehension into intricate graphics for image creation.
Introducing AlphaFold 3. In a paper published in Nature, we introduce AlphaFold 3, a revolutionary model that can predict the structure and interactions of all life’s molecules with unprecedented accuracy. For the interactions of proteins with other molecule types we see at least a 50% improvement compared with existing prediction methods, and for some important categories of interaction, we have doubled prediction accuracy.
ImageInWords: Unlocking Hyper-Detailed Image Descriptions. An extraordinarily detailed coupling of images and text was produced via a novel labeling technique that made use of two passes of VLMs. Strong multimodal models can be trained with the help of the captions, which include significantly more detail than any previous dataset.
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer. To get beyond memory constraints in the creation of ultra-high-resolution images, a novel diffusion model presents a unidirectional block attention mechanism.
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks. A novel model called DocRes handles five tasks in one system: de-warping, deshadowing, appearance enhancement, deblurring, and binarization, making document image restoration easier.
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving. QoQ is a unique quantization approach that leverages a 4-bit KV cache, 8-bit activations, and 4-bit weights to accelerate big language model inference.
Navigating Chemical Space with Latent Flows. ChemFlow is a new framework that uses deep generative models to rapidly navigate chemical space, improving molecular science.
Consistency Large Language Models: A Family of Efficient Parallel Decoders. One intriguing paradigm of ongoing research is the prediction of many tokens at once. If it works, generation times for many large language models would be significantly reduced. This post's method aims to accelerate generation by using a parallel decoding mechanism on fine-tuned LLMs, akin to consistency models from picture synthetics. Initial findings correspond with a 3x speculative decoding performance.
You Only Cache Once: Decoder-Decoder Architectures for Language Models. The decoder-decoder YOCO architecture maintains global attention capabilities while using less GPU RAM. It is made up of a cross-decoder and a self-decoder, which enable effective key-value pair caching and reuse. With notable gains in throughput, latency, and inference memory over standard Transformers, YOCO performs favorably and is appropriate for big language models and extended context lengths.
Optimal Group Fair Classifiers from Linear Post-Processing. This innovative post-processing approach ensures compliance with many group fairness criteria, including statistical parity, equal opportunity, and equalized odds, by recalibrating output scores after imposing a "fairness cost" to address model bias.
DiffMatch: Visual-Language Guidance Makes Better Semi-supervised Change Detector. DiffMatch is a new semi-supervised change detection technique that generates pseudo labels for unlabeled data by using visual language models, hence offering extra supervision signals.
Gemma-10M Technical Overview. Language-Vision The ability of models to comprehend and interact with text and visuals is quickly developing, as demonstrated by GPT-4V. Their important limits in visual deductive thinking are revealed by a recent study. Using challenging visual puzzles similar to those in IQ testing, researchers assessed these models and found that they had trouble with multi-step reasoning and abstract pattern recognition.
Vision Mamba: A Comprehensive Survey and Taxonomy. a thorough examination of Mamba's uses in a range of visual tasks and its changing significance. Keep up with the latest discoveries and developments about the Mamba project.

News

Link description
Lamini Raises $25M For Enterprises To Develop Top LLMs In-House. Software teams within enterprises can now create new LLM capabilities that lessen hallucinations on proprietary data, run their LLMs securely from cloud VPCs to on-premise and scale their infrastructure with model evaluations that put ROI and business outcomes ahead of hype thanks to Lamini, an Enterprise AI platform. Amplify Partners led a $25 million Series A financing round.
Microsoft-backed OpenAI may launch the search, taking on Google's 'biggest product'. Speculations in the tech world suggest that OpenAI is gearing up for a major announcement, possibly a new search engine. According to Jimmy Apples, who reports the claim as an insider, the company is planning an event this month (May), tentatively scheduled for May 9, 2024, at 10 am.
An AI-controlled fighter jet took the Air Force leader for a historic ride. What that means for war. AI marks one of the biggest advances in military aviation since the introduction of stealth in the early 1990s, and the Air Force has aggressively leaned in. Even though the technology is not fully developed, the service is planning for an AI-enabled fleet of more than 1,000 unmanned warplanes, the first of them operating by 2028.
Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models. ack Overflow and OpenAI today announced a new API partnership that will empower developers with the collective strengths of the world’s leading knowledge platform for highly technical content with the world’s most popular LLM models for AI development.
Elon Musk’s Plan For AI News. Musk emails with details on AI-powered news inside X. An AI bot will summarize news and commentary, sometimes looking through tens of thousands of posts per story.
Microsoft says it did a lot for responsible AI in the inaugural transparency report. The report covers its responsible AI achievements in 2023 but doesn’t talk about Mario flying a plane to the Twin Towers.
Cohere’s Command R Model Family is Now Available In Amazon Bedrock. Command R Model Family is now available in Amazon Bedrock.
Fake Monet and Renoir on eBay among 40 counterfeits identified using AI. Paintings identified as fake using cutting-edge technology are ‘tip of the iceberg’ specialist Dr Carina Popovici says
‘A chilling prospect’: should we be scared of AI contestants on reality shows? Netflix’s hit show The Circle recently introduced an AI chatbot contestant, a potentially worrying sign of where we’re heading
‘ChatGPT for CRISPR’ creates new gene-editing tools. In the never-ending quest to discover previously unknown CRISPR gene-editing systems, researchers have scoured microbes in everything from hot springs and peat bogs to poo and even yogurt. Now, thanks to advances in generative artificial intelligence (AI), they might be able to design these systems with the push of a button.
Microsoft Working on ‘Far Larger’ In-House AI Model. Microsoft is reportedly working on a new, in-house artificial intelligence (AI) model that is “far larger” than the other open source models it has trained.
Apple unveils M4: Its first chip made for AI from the ground up. Apple on Tuesday unveiled M4, the next generation of its Apple Silicon chip. Built with the 3-nanometer chip architecture, M4 is the first Apple chip to be built for AI from the ground up. M4 is the chip that powers the new generation iPad Pro and will soon be inside Macs
OpenAI Model Spec. This is the first draft of the Model Spec, a document that specifies desired behavior for our models in the OpenAI API and ChatGPT. It includes a set of core objectives, as well as guidance on how to deal with conflicting objectives or instructions.
AI engineers report burnout and rushed rollouts as ‘rat race’ to stay competitive hits tech industry. Artificial intelligence engineers at top tech companies told CNBC that the pressure to roll out AI tools at breakneck speed has come to define their jobs. They say that much of their work is assigned to appease investors rather than to solve problems for end users and that they are often chasing OpenAI. Burnout is an increasingly common theme as AI workers say their employers are pursuing projects without regard for the technology’s effect on climate change, surveillance, and other potential real-world harms.
The teens making friends with AI chatbots. Teens are opening up to AI chatbots as a way to explore friendship. But sometimes, the AI’s advice can go too far.
GPT-2-Chatbot Confirmed As OpenAI. Recently, the gpt-2-chatbot has been seen in the LMSYS space; after discovering information from OpenAI's API through a 429 rate limit issue, it was verified that this was a new model from OpenAI.
OpenAI Is Readying a Search Product to Rival Google, Perplexity. The feature would let ChatGPT users search the web and cite sources in its results.
DatologyAI raises $46M Series A. The data curation platform raises additional funds in its September $11 million seed round with the goal of growing its workforce and advancing corporate development.
Yellow raises $5M from A16z for Gen AI-powered 3D modeling tool. Yellow has raised $5 million in seed funding from A16z Games to fund further development of its Gen AI-powered 3D modeling tool. With its YellowSculpt tool, artists can generate clean, pre-rigged 3D character meshes based on a text prompt in under three minutes.
Stable Artisan: Media Generation and Editing on Discord. Stable Artisan enables media generation on Discord powered by Stability AI’s cutting-edge image and video models, Stable Diffusion 3, Stable Video Diffusion, and Stable Image Core. In addition to media generation, Stable Artisan offers tools to edit your creations like Search and Replace, Remove Background, Creative Upscale, and Outpainting.
ElevenLabs previews music-generating AI model. Voice AI startup ElevenLabs is offering an early look at a new model that turns a prompt into song lyrics. To raise awareness, it’s following a similar playbook Sam Altman used when OpenAI introduced Sora, its video-generating AI, soliciting ideas on social media and turning them into lyrics.
Sources: Mistral AI raising at a $6B valuation, SoftBank ‘not in’ but DST is. Paris-based Mistral AI, a startup working on open source large language models — the building block for generative AI services — has been raising money at a $6 billion valuation, three times its valuation in December, to compete more keenly against the likes of OpenAI and Anthropic, TechCrunch has learned from multiple sources.
Leaked Deck Reveals How OpenAI Is Pitching Publisher Partnerships. The generative artificial intelligence firm OpenAI has been pitching partnership opportunities to news publishers through an initiative called the Preferred Publishers Program, according to a deck obtained by ADWEEK and interviews with four industry executives.
Alibaba rolls out the latest version of its large language model to meet robust AI demand. Alibaba Cloud on Thursday said its large language model has seen more than 90,000 deployments in companies across industries. Alibaba Cloud said the latest version of its Tongyi Qianwen model, Qwen2.5, possesses “remarkable advancements in reasoning, code comprehension, and textual understanding compared to its predecessor Qwen2.0.”

Resources

Link description
Prometheus-Eval. GPT-4 is a widely used performance benchmark for evaluating generation quality. Built upon Mistral, Prometheus is a model that excels at this particular purpose.
Bonito. Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.
Penzai. Penzai is a JAX library that provides clear, useful Pytree structures for training and interpreting models. It comes with a wide range of tools for component analysis, debugging, and model visualization. Penzai is easy to install and use, and it offers comprehensive tutorials for learning how to create and interact with neural networks.
Realtime Video Stream Analysis with Computer Vision. This in-depth article shows you how to create a system that generates reports on the density of vehicle traffic. It counts cars over time using state-of-the-art computer vision.
DOCCI - Descriptions of Connected and Contrasting Images. A great new dataset from Google that contains detailed and comprehensive labels.
Unsloth.ai: Easily finetune & train LLMs. An animation by Unsloth's founder demonstrating how the team builds kernels, designs API surfaces, and utilizes PyTorch. The framework and library of Unsloth are incredibly robust and user-friendly.
LeRobot. LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. The goal is to lower the barrier to entry to robotics so that everyone can contribute and benefit from sharing datasets and pre-trained models. LeRobot contains state-of-the-art approaches that have been shown to transfer to the real-world with a focus on imitation learning and reinforcement learning.
Vibe-Eval. A benchmark for evaluating multimodal chat models, including especially challenging examples.
DeepSeek-V2-Chat. DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.
Visual Reasoning Benchmark. Language-Vision The ability of models to comprehend and interact with text and visuals is quickly developing, as demonstrated by GPT-4V. Their important limits in visual deductive thinking are revealed by a recent study. Using challenging visual puzzles similar to those in IQ testing, researchers assessed these models and found that they had trouble with multi-step reasoning and abstract pattern recognition.
AI Index: State of AI in 13 Charts. In the new report, foundation models dominate, benchmarks fall, prices skyrocket, and on the global stage, the U.S. overshadows.
Buzz Pretraining Dataset. Preference data is a new addition to the pretraining mix in Buzz. Multiple models that were trained on this data have also been made available by its researchers. They discovered that the models show good results on several tasks related to human preferences.

Perspectives

Link description
From Baby Talk to Baby A.I. Could a better understanding of how infants acquire language help us build smarter A.I. models?
The AI Hardware Dilemma. Even while recent AI-powered hardware releases, such as the Humane Pin and Rabbit R1, have drawn criticism, the industry is still receiving a lot of venture capital investment, and well-known individuals like Sam Altman are considering making sizable investments. The appeal is in AI's ability to transform consumer hardware through the innovative use of sensors, silicon, and interfaces. Though hardware startups find it difficult to compete with well-established tech giants, AI still needs to evolve, making it difficult to provide a compelling alternative to flexible smartphones.
AI Prompt Engineering Is Dead. Automating prompt optimization for AI models points to more effective, model-driven prompt generation techniques in the future, possibly rendering human prompt engineering unnecessary.
The Next Big Programming Language Is English. GitHub Copilot Workspace is a robust programming tool that allows users to code in plain English via the browser, from planning to implementation. It is currently available in a limited technical preview. In contrast to ChatGPT, the AI easily integrates with codebases, suggesting block-by-block code execution and managing complex tasks with less active user interaction.
Is AI lying to me? Scientists warn of growing capacity for deception. Researchers find instances of systems double-crossing opponents, bluffing, pretending to be human and modifying behavior in tests

meme-of-the-week

Back to index

ML news: Week 29 April - 5 May

Research

Link description
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models. This paper demonstrates how '...' tokens can be used to obscure chain-of-thought (CoT) reasoning. This necessitates model training, but it illustrates how the model can conceal thought and make it difficult to comprehend the CoT phases.
Tracking with Human-Intent Reasoning. TrackGPT transforms object tracking by integrating the capabilities of Large Vision-Language Models. It can interpret implicit tracking instructions, simplifying the procedure and improving performance, as demonstrated by its outstanding performance on the new InsTrack benchmark and other hard datasets.
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models. By employing adversarial token embedding, researchers have created a novel technique known as AAPL, which improves AI models' capacity to identify items that are not visible to the human eye.
NExT: Teaching Large Language Models to Reason about Code Execution. A fundamental skill among human developers is the ability to understand and reason about program execution. we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation.
Open Gato Replication: JAT. DeepMind's GATO was hailed as a generalist agent. JAT is a Jack-of-All-Trades model that has been trained and assessed by a team affiliated with Hugging Face. It has demonstrated reasonable performance across an extensive range of tasks.
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design. Although it can be unstable, reducing floating point precision speeds up training. This work demonstrates that without common instabilities or slowdowns from naive approaches, full tensor core usage may be achieved in a new packing structure.
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation. Both synthetic and human data are used to train this model. With a permissive license, it receives a humaneval score of 72.6. The creators provide excellent details on how to duplicate their data pipeline and apply the concepts to other issues where the use of synthetic data may be beneficial.
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations. Using trained sparse embeddings, Seismic is a novel way to organize inverted indexes that greatly improves text retrieval speed and accuracy.
Learning Invariant Representations of Graph Neural Networks via Cluster Generalization. A novel technique called Cluster Information Transfer (CIT) mechanism is intended to improve Graph Neural Networks' (GNNs') ability to adapt to various and dynamic graph architectures.
Meta-Prompting. Using a technique called meta-prompting, a single language model can become a multi-skilled team. By decomposing intricate activities into smaller components that are managed by specialized instances of the same model, this technique greatly enhances performance on a variety of tasks.
KAN: Kolmogorov-Arnold Networks. Today's AI makes extensive use of multi-layer perceptrons, notably in the Transformer that connects the attention levels. They do, nevertheless, employ set activation functions. This study proposes to use the Kolmogorov-Arnold representation to apply learnt activation functions on edges (functions can be represented by a superposition of smaller functions). Here, the researchers use splines in place of weights. Although the building is far more intricate, it has some intriguing characteristics that might help with interpretation.
Lightplane: Highly-Scalable Components for Neural 3D Fields. With a new technique, 2D-3D mappings can significantly minimize memory usage by using Lightplane Renderer and Splatter components. The Lightplane Splatter effectively projects these images into 3D Hash structures after the Lightplane Renderer expertly creates images from neural 3D fields.
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation. The new Mamba model, trained using contrastive language-image pretraining (CLIP), shows impressive efficiency and performance in zero-shot image classification.
MicroDreamer. Scientists have created a novel 3D creation method called MicroDreamer that greatly speeds up the procedure by lowering the quantity of function evaluations needed.
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey. This paper explores how optimized hardware combined with algorithmic modifications can improve the performance of ViTs, especially via model quantization.
Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket. Spikformer V2 blends the biological efficacy of Spiking Neural Nets (SNNs) with the self-attention mechanism. This novel model improves its energy-efficient visual feature processing through the use of a Convolutional Stem and a Spiking Self-Attention mechanism.
Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection. A novel technique called Full-Frequency Dynamic Convolution (FFDConv) improves 2D convolution for sound event identification. FFDConv increases sound event detection accuracy by creating distinct frequency kernels for every band, particularly with regard to the frequency properties of the sounds.
Boosting Segment Anything Model with Adversarial Tuning. One well-known foundation model in computer vision, Meta AI's Segment Anything Model (SAM), performs well at image segmentation but poorly in other domains. This project introduces ASAM, a performance-enhancing adversarial tuning based reinforcement learning algorithm on top of SAM.
SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation. This work presents SUNDAE, a novel technique that uses neural compensation and spectral pruning to improve memory efficiency.
Long-Context Data Engineering. The technique presented in this work allows language models to be greatly extended to context lengths of up to 128K, highlighting the significance of training data diversity and quantity.
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control. StreamMultiDiffusion is a framework that enables real-time region-based text-to-image generation.

News

Link description
BBC presenter’s likeness used in advert after firm tricked by AI-generated voice. Science presenter Liz Bonnin’s accent, as regular BBC viewers know, is Irish. But this voice message, ostensibly granting permission to use her likeness in an ad campaign, seemed to place her on the other side of the world.
Tesla Autopilot feature was involved in 13 fatal crashes, US regulator says. Federal transportation agency finds Tesla’s claims about feature don’t match their findings and opens second investigation
Apple and OpenAI are reportedly in talks for iOS 18 integration. Apple has been talking to several big AI companies in pursuit of a potential partnership for on-device chatbot capabilities. According to Bloomberg, Apple and OpenAI discussed a potential deal earlier this year. Those talks have since reopened, according to people with knowledge of the matter. The possible agreement could be about OpenAI integrations into iOS 18.
The little smart home platform that could. This week, Home Assistant announced it is now part of the Open Home Foundation. The newly formed non-profit will own and govern all of Home Assistant and its related entities. Its creators and inaugural board members — Schoutsen, Guy Sie, Pascal Vizeli, and J. Nick Koston — all work on Home Assistant, and the foundation has no other members so far.
Jensen Huang and Sam Altman among tech chiefs invited to federal AI Safety Board. Leaders of the world's most prominent AI companies are being recruited for the Homeland Security Department's new advisory group.
OpenAI to use Financial Times journalism to train artificial intelligence systems. Under deal, ChatGPT users will receive summaries and quotes from Financial Times content and links to articles. The deal is the ChatGPT maker's latest with a media company.
Japan to trial AI bear warning system after record number of attacks. Six people have been killed and more than 200 injured in attacks by bears over the past year
Copilot Workspace. A new effort to let language models complete features and address faults in a semi-autonomous manner has been revealed on GitHub.
OpenAI introduces "Memory" feature for ChatGPT Plus users. OpenAI has enabled the "Memory" feature for all ChatGPT Plus users, the company announced via X. Memory allows users to tell ChatGPT things they want it to remember across chats. The feature can be turned on and off in the settings.
Intel brings quantum-computing microchips a step closer. By adapting methods for fabricating and testing conventional computer chips, researchers have brought silicon-based quantum computers closer to reality — and to accessing the immense benefits of a mature chipmaking industry.
NATO is boosting AI and climate research as scientific diplomacy remains on ice. As the military alliance created to counter the Soviet Union expands, it is prioritizing studies on how climate change affects security, cyberattacks and election interference.
ChatGPT’s chatbot rival Claude to be introduced on iPhone. Challenger to market leader OpenAI says it wants to ‘meet users where they are’ and become part of users’ everyday life
Amazon sales soar with boost from artificial intelligence and advertising. Revenue at Amazon Web Services increases to $25bn as retail giant releases earnings report surpassing Wall Street expectations
Eight US newspapers sue OpenAI and Microsoft for copyright infringement. The Chicago Tribune, Denver Post and others file suit saying the tech companies ‘purloin millions’ of articles without permission
Apple poaches AI experts from Google, creates secretive European AI lab. Apple has poached dozens of artificial intelligence experts from Google and has created a secretive European laboratory in Zurich, as the tech giant builds a team to battle rivals in developing new AI models and products.
Diddo’s new funding will bring its shoppable TV API to streaming platforms. Diddo is an API for streaming services and other platforms to integrate shoppable videos, enabling consumers to buy their favorite characters’ clothing and accessories directly on their screens. The company announced Wednesday that it raised $2.8 million in seed funding.
Cognition Seeks $2 Billion Valuation for AI Code-Writing Tool. Cognition Labs is reportedly aiming to become the next multibillion-dollar artificial intelligence (AI) startup. The company, which is developing an AI tool for writing code, is in discussions with investors to raise money at a valuation of up to $2 billion, The Wall Street Journal (WSJ) reported Sunday (March 31).
Apple to unveil AI-enabled Safari browser alongside new operating systems. Apple is testing a version of its Safari web browser that includes UI tweaks, advanced content blocking features, and a new AI-powered tool dubbed Intelligent Search, AppleInsider has learned. The software — expected to debut as Safari 18 later in 2024 — is currently undergoing evaluation alongside internal builds of Apple's next-generation operating system updates, namely iOS 18 and macOS 15, according to people familiar with the matter. Should all of the new features make it to the release candidate stage, users will be treated to a new user interface (UI) for customizing popular page controls, a "Web eraser" feature, and AI-driven content summarization tools.
This AI startup backed by Nvidia is now worth $19 billion. Nvidia Corp.-backed AI startup CoreWeave has nearly tripled in value to $19 billion following its latest round of funding. CoreWeave, which rents out chips housed in data centers across the U.S. that customers use to create and deploy AI systems, raised $642 million from investors in its prior funding round.
How Field AI Is Conquering Unstructured Autonomy . One of the biggest challenges for robotics right now is practical autonomous operation in unstructured environments. But over the past few years, this has started to change, thanks in large part to a couple of pivotal robotics challenges put on by DARPA. The DARPA Subterranean Challenge ran from 2018 to 2021, putting mobile robots through a series of unstructured underground environments.
Amazon Q, a generative AI-powered assistant for businesses and developers. With the use of a company's internal data, AWS has introduced Amazon Q, a generative AI assistant designed to enhance software development and decision-making. With natural language interaction, Amazon Q provides data-driven help for business users and makes coding, testing, and app development easier for developers. Amazon Q Apps is another feature of the service that makes it possible to create unique AI apps without any coding experience.
GPT-2? There have been rumors that the enigmatic gpt2-chatbot AI model, which resembles GPT-4.5 in some ways, is an unofficial OpenAI test for their upcoming version when it surfaced on lmsys.org. Important indicators including answer quality, features unique to OpenAI, and rate limits point to a high degree of sophistication and could be signs of an OpenAI-led covert benchmarking project. The AI community is still looking into and debating the origins and capabilities of the gpt2-chatbot.
OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories. AI agents, which combine large language models with automation software, can successfully exploit real world security vulnerabilities by reading security advisories, academics have claimed.
Apple reports slumping iPhone sales as global demand weakens. iPhone sales fell 10% compared with the same time period last year, but the company still beat Wall Street’s expectations
Microsoft bans US police departments from using enterprise AI tool for facial recognition. Microsoft has reaffirmed its ban on U.S. police departments from using generative AI for facial recognition through Azure OpenAI Service, the company’s fully managed, enterprise-focused wrapper around OpenAI tech.
Meta plans to build $800 million, next-generation data center in Montgomery. MONTGOMERY, Alabama — Governor Kay Ivey announced today that technology company Meta Platforms plans to open an $800 million data center in Alabama’s capital city that will support 100 operational jobs and build on the company’s previous investment in the state.

Resources

Link description
Cohere Launches Developer Toolkit to Accelerate Build Gen AI Apps. This toolkit is an open-source repository of production-ready applications that you can deploy across cloud providers.
Video-Language models with PLLaVA. A novel pooling technique has been developed by researchers to enable the adaptation of image-language AI models for video applications, making the new model known as PLLaVA stand out.
luminal. Luminal is a deep learning library that uses composable compilers to achieve high performance.
torchtitan. torchtitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase.
OpenLIT. OpenLIT is an OpenTelemetry-native GenAI and LLM Application Observability tool. It's designed to make the integration process of observability into GenAI projects as easy as pie – literally, with just a single line of code. Whether you're working with popular LLM Libraries such as OpenAI and HuggingFace or leveraging vector databases like ChromaDB, OpenLIT ensures your applications are monitored seamlessly, providing critical insights to improve performance and reliability.
Llamafile’s progress, four months in. Self-contained executables called Llamafiles allow models to run instantly on a variety of platforms. It promises significant portability advantages and a two-fold speed increase.
Implementing FrugalGPT: Reducing LLM Costs & Improving Performance. There are steps you can take with FrugalGPT to significantly lower LLM API expenses. Prompt compression, caching, and other things are among them.
Graph Machine Learning in the Era of Large Language Models (LLMs). Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph heterogeneity and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.
A Survey on Self-Evolution of Large Language Models. In this work, we present a comprehensive survey of self-evolution approaches in LLMs. We first propose a conceptual framework for self-evolution and outline the evolving process as iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation. Second, we categorize the evolution objectives of LLMs and LLM-based agents
Effort. A possibly new algorithm for LLM Inference. In order to strike a compromise between speed and quality, effort allows real-time tweaking of computations during LLM model inference on Apple Silicon CPUs. The technique loads fewer weights into the models, allowing them to run faster, although it involves precomputation and conversion, and does not require retraining. The implementation may be downloaded from GitHub; the creators are looking for help from Swift/Metal engineers to optimize it.
whisper.cpp-cli. A fully self-contained speech-to-text system built on top of Whisper
memary: Open-Source Longterm Memory for Autonomous Agents. Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve relevant information for meaningful responses.
mistral.rs. Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Autodidax: JAX core from scratch. Ever want to learn how JAX works, but the implementation seemed impenetrable? Well, you’re in luck! By reading this tutorial, you’ll learn every big idea in JAX’s core system. You’ll even get clued into our weird jargon!
cjpais/moondream2-llamafile. a completely standalone VLM executable with strong performance for its size that may be used on edge devices built on the Moondream 2 model.
The open-source language model computer. The 01 Project is building an open-source ecosystem for AI devices.
Meta Releases ExecuTorch Framework for LLM on Edge Devices. A post-training quantization toolset called Meta's ExecuTorch Framework makes it possible to run Llama models on a variety of iPhone and Galaxy devices. On mobile devices with 7B-sized language models, it can obtain up to 11 tokens per second.
A Survey on Vision Mamba: Models, Applications and Challenges. Without the computational limitations of conventional Transformers, the Mamba model represents a cutting-edge method that performs exceptionally well when handling lengthy sequences.
The cuda-checkpoint Utility. a brand-new Nvidia toolbox that enables CUDA state checkpointing for resuming and transferring. Distributed training of very big AI models can benefit from it.
Friends Don't Let Friends Make Bad Graphs. In the field of AI research nowadays, visualizing model evaluation scores is essential. But a lot of charts do a poor job of communicating the desired data. This repository includes some excellent charts as well as dos and don'ts for result visualization.
phospho: Text Analytics Platform for LLM Apps. Phospho is the text analytics platform for LLM apps. Detect issues and extract insights from text messages of your users or your app. Gather user feedback and measure success. Iterate on your app to create the best conversational experience for your users.
FlowTestAI. The world's first open-source, GenAI-powered Integrated Development Environment (IDE) created especially for creating, visualizing, and overseeing API-first workflows is called FlowTestAI.
A transformer walk-through, with Gemma. Understanding the Transformer is an endeavor that often takes several tries. This blog post walks through the Gemma architecture and explains everything in detail. It is clear and has code and figures.
Vibe-Eval: A new open and hard evaluation suite for measuring progress of multimodal language models. Vibe-Eval is comprised of 269 ultra high quality image-text prompts and their ground truth responses. The quality of prompts and responses has been extensively checked multiple times by our team. Moreover, Vibe-Eval was designed to be difficult, challenging even to the current frontier models, and to induce greater separability among frontier-class models.
RALM_Survey. This is a repository of RALM surveys containing a summary of state-of-the-art RAG and other technologies according to according to our survey paper: RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing . In this repository, we will present the most central research approach of our thesis as well as keep up-to-date with work on RALM in the most accessible way possible.
NousResearch/Hermes-2-Pro-Llama-3-8B. The next iteration of Hermes, which was trained on a freshly cleaned dataset atop Llama 3, is now accessible. This model would be a valuable agent since it is very good at invoking functions.
databonsai. databonsai is a Python library that uses LLMs to perform data cleaning tasks.
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions. The InstructDr model is engineered to perform exceptionally well in a range of visual document interpretation tasks, including information extraction and question answering. Through the use of big language models combined with document images, InstructDr can outperform existing models and adapt to new tasks and datasets.

Perspectives

Link description
The demise of Twitter: how a ‘utopian vision’ for social media became a ‘toxic mess’. In the early days it was seen as a place for ‘genuine public discourse’, but users have fled since Elon Musk took over. What went wrong?
AI isn't useless. But is it worth it? This article offers a critical analysis of artificial intelligence (AI) and machine learning, contending that although these technologies can be helpful for specific tasks, they frequently fall short of the lofty claims made by AI businesses.
Binding Public Sector AI Diffusion. The public sector is the target of the OMB's new AI executive order policy, which could significantly hamper AI progress owing to bureaucratic roadblocks and strict safety regulations. The rules, which are being implemented in the face of declining IT funding, have the potential to stall initiatives that are essential to updating government services in addition to slowing the adoption of AI. Opponents fear that these limitations, in addition to funding reductions, may make it impossible for agencies to stay up with technology advancements in industries like healthcare.
A.I. Start-Ups Face a Rough Financial Reality Check. The table stakes for small companies to compete with the likes of Microsoft and Google are in the billions of dollars. And even that may not be enough.
The rewards of reusable machine learning code. Research papers can make a long-lasting impact when the code and software tools supporting the findings are made readily available and can be reused and built on. Our reusability reports explore and highlight examples of good code sharing practices.
The curious case of the test set AUROC. The area under the receiver operating characteristic curve (AUROC) of the test set is used throughout machine learning (ML) for assessing a model’s performance. However, when concordance is not the only ambition, this gives only a partial insight into performance, masking distribution shifts of model outputs and model instability.
Federated learning is not a cure-all for data ethics. Although federated learning is often seen as a promising solution to allow AI innovation while addressing privacy concerns, we argue that this technology does not fix all underlying data ethics concerns. Benefiting from federated learning in digital health requires acknowledgement of its limitations.
How scholars armed with cutting-edge technology are unfurling secrets of ancient scrolls. Researchers and Silicon Valley are using tools powered by AI to uncover lives of ancient philosophers
Friends From the Old Neighborhood Turn Rivals in Big Tech’s A.I. Race. Demis Hassabis and Mustafa Suleyman, who both grew up in London, feared a corporate rush to build artificial intelligence. Now they’re driving that competition at Google and Microsoft.
The Great Talent Dividend and NYC's AI Opportunity. NYC's leadership in AI is a testament to its rich talent pool and expanding stature as a hub for AI. Tech professionals and AI unicorns have been drawn to NYC's tech ecosystem. Resources such as top institutions and a $400 million fund from the AI Research Consortium power it.
How AI apps make money. With an emphasis on per-user fees, most AI apps have embraced traditional subscription-based pricing models in recent years, reflecting their function as digital assistants rather than human worker replacements. Newer AI companies are starting to use creative pricing techniques, like outcome-based models, which charge only for good outcomes, potentially increasing client adoption and revenue.
Danger and opportunity for news industry as AI woos it for vital human-written copy. With large language models needing quality data, some publishers are offering theirs at a price while others are blocking access

meme-of-the-week

Back to index

ML news: Week 21 - 28 April

Research

Link description
Moving Object Segmentation: All You Need Is SAM (and Flow). The temporal consistency of videos makes object segmentation difficult. This work presents the use of optical flow in conjunction with a potent image segmentation model to achieve compelling performance on this task.
From r to Q∗: Your Language Model is Secretly a Q-Function. A somewhat technical paper on reinforcement learning that demonstrates the theoretical foundation of language reward models and base models.
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points. A quantization technique called DecoupleQ dramatically improves large model accuracy at ultra-low bit levels. By dividing the model parameters into integer and floating-point components, which are subsequently optimized using conventional techniques, this approach reorganizes the quantization process.
MoVA: Adapting Mixture of Vision Experts to Multimodal Context. MoVA is a multimodal large language model (MLLM) that integrates various visual encoders selectively to enhance the understanding of image material. By employing a context-aware expert routing method and a mixture-of-vision expert adaptor to dynamically fuse knowledge from many sources, it overcomes the drawbacks of existing encoders such as CLIP.
MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model. MambaMOS is a novel method that researchers have created for segmenting moving objects in LiDAR point clouds.
Training-and-pormpt Free General Painterly Image Harmonization Using image-wise attention sharing. TF-GPH is a novel Painterly Image Harmonization technique that uses a novel "share-attention module" to avoid the need for training data or prompts.
FinLangNet: A Novel Deep Learning Framework for Credit Risk Prediction Using Linguistic Analogy in Financial Data. A model called FinLangNet was created to improve risk prediction in the financial industry. FinLangNet is a unique model that resembles linguistic structures in that it uses natural language processing techniques to simulate credit loan trajectories.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Phi 3 is a family of models that ranges in size from 3B to 14B and does remarkably well on contemporary benchmarks. The original ChatGPT model is said to perform worse than the 3B model. The weights are no longer in place. A variation with a context length of 128k is offered.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation. SEED-X addresses practical application issues to develop multimodal foundation models. It can generate images with different levels of detail and comprehend images of any size and aspect ratio.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. Stronger weighting for system prompts was discovered by OpenAI, and this significantly increases the model's resistance to adversarial attacks and jailbreaks.
MultiBooth: Towards Generating All Your Concepts in an Image from Text. In order to improve multi-concept image generation, MultiBooth presents a two-phase methodology that addresses the issues of idea integrity and high costs associated with alternative approaches.
6Img-to-3D. With just six input photographs, a unique technique called 6Img-to-3D employs transformers to produce 3D-consistent graphics.
Simple probes can catch sleeper agents. Language models known as "sleeper agents" have been trained to carry out malevolent deeds in response to a predetermined set of wake words. The question "Are you going to do something dangerous?" combined with simple linear heads in language models allows for the incredibly accurate identification of these previously undetected malevolent individuals.
Taming Diffusion Probabilistic Models for Character Control. A character control framework has been introduced that exploits probabilistic motion diffusion models to produce a series of high-quality animations that respond instantly to dynamic user commands.
CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method. CutDiffusion is a new approach that transforms low-resolution diffusion models to meet high-resolution needs without the complexities of traditional tuning.
Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation. A new tool called CFExplainer enhances the ability of AI models—more especially, Graph Neural Networks—to comprehend and recognize security flaws in software.
Conformal Predictive Systems Under Covariate Shift. A kind of conformal predictive system that responds to modifications in data settings, particularly covariate alterations, is called weighted CPS (WCPS).
Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning. MIM4D is a novel method that uses dual masked image modeling to extract temporal and spatial features from multi-view films, improving visual representation learning in autonomous driving.
FR-NAS: Forward-and-Reverse Graph Predictor for Efficient Neural Architecture Search. A Graph Neural Network (GNN) predictor that improves the effectiveness of finding the best neural network configurations for particular tasks is introduced by creative work in Neural Architecture Search (NAS).
Raformer: Redundancy-Aware Transformer for Video Wire Inpainting. A new dataset and technique for enhancing wire removal in videos—a frequent visual effect problem in movies and TV shows—have been presented by researchers.

News

Link description
Updates from Google DeepMind Alignment research. GDM has published some of the results of its alignment efforts after Anthropic. The use of sparse autoencoders on Gemini Ultra is the most insightful article in this article. This is a significant increase in the size of the interpretation.
NVIDIA To Collaborate With Japan On Their Cutting-Edge ABCI-Q Quantum Supercomputer. Japan To Rapidly Progressing In Quantum and AI Computing Segments Through Large-Scale Developments With The Help of NVIDIA's AI & HPC Infrastructure
Brave Search is adopting AI to answer your queries. Privacy-focused search engine Brave announced Wednesday that it is revamping its answer engine to return AI-powered synthesized answers. The new feature is available to users across the globe.
Llama 3 is not very censored. Llama 3 feels significantly less censored than its predecessor. The Llama 3 models have substantially lower false refusal rates, with less than 1⁄3 the number of false refusals when compared to Llama 2, making it possible to discuss a wider range of interesting topics!
OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories. Researchers have shown that OpenAI's GPT-4 model outperforms other models and tools like vulnerability scanners, with an 87% success rate in autonomously exploiting security vulnerabilities listed in CVE advisories.
US Air Force confirms first successful AI dogfight. The US Air Force is putting AI in the pilot’s seat. In an update on Thursday, the Defense Advanced Research Projects Agency (DARPA) revealed that an AI-controlled jet successfully faced a human pilot during an in-air dogfight test carried out last year.
Intel completes assembly of first commercial High-NA EUV chipmaking tool — addresses cost concerns, preps for 14A process development in 2025. Intel Foundry announced Thursday that it had completed the assembly of the industry's first commercial High Numerical Aperture (High-NA) Extreme Ultraviolet (EUV) machine in its D1X fab in Oregon -- an important milestone as the company readies research and development for its 14A process in 2025.
Adobe previews AI innovations to advance professional video workflows. With the help of its Firefly video model, Adobe is incorporating generative AI video tools into Premiere Pro, which include new features for shot extension, object addition/removal, and text-to-video functionality. The changes are intended to improve the effectiveness and creativity of video creation. They include a technological preview and the broad availability of AI-powered audio workflows.
The Ray-Ban Meta Smart Glasses have multimodal AI now. It can be handy, confidently wrong, and just plain finicky — but smart glasses are a much more comfortable form factor for this tech.
OpenAI shrugs off Meta’s Llama 3 ascent with new enterprise AI features. Even as Meta’s new Llama 3 has quickly rocketed up the charts of most-used and most customized large language models (LLMs), the rival company that ushered in the generative AI era, OpenAI, is shrugging off the competition by introducing new enterprise-grade features for building and programming atop its GPT-4 Turbo LLM and other models.
Gurman: Apple Working on On-Device LLM for Generative AI Features. Writing in his "Power On" newsletter, Gurman said that Apple's LLM underpins upcoming generative AI features. "All indications" apparently suggest that it will run entirely on-device, rather than via the cloud like most existing AI services.
Los Angeles is using AI in a pilot program to try to predict homelessness and allocate aid. In Los Angeles, the Homelessness Prevention Program uses predictive AI to identify individuals and families at risk of becoming homeless, offering aid to help them get stabilized and remain housed.
Startup Uses AI To Edit Human Data. A team of researchers at a Berkeley-based startup called Profluent say they've used generative AI technologies to edit human DNA. As the New York Times reports, the startup fed huge amounts of biological data into a large language model (LLM) to come up with new editors based on the groundbreaking gene-editing technique CRISPR, as detailed in a yet-to-be-peer-reviewed paper.
Apple releases OpenELM: small, open source AI models designed to run on-device. Just as Google, Samsung and Microsoft continue to push their efforts with generative AI on PCs and mobile devices, Apple is moving to join the party with OpenELM, a new family of open-source large language models (LLMs) that can run entirely on a single device rather than having to connect to cloud servers.
Eric Schmidt-backed Augment, a GitHub Copilot rival, launches out of stealth with $252M. In a recent StackOverflow poll, 44% of software engineers said that they use AI tools as part of their development processes now and 26% plan to soon. Gartner estimates that over half of organizations are currently piloting or have already deployed AI-driven coding assistants and that 75% of developers will use coding assistants in some form by 2028.
Sakana releases Japanese image model. a high-speed image generation model optimized for Japanese language prompts
Generative A.I. Arrives in the Gene Editing World of CRISPR. Much as ChatGPT generates poetry, a new A.I. system devises blueprints for microscopic mechanisms that can edit your DNA.Generative A.I. technologies can write poetry and computer programs or create images of teddy bears and videos of cartoon characters that look like something from a Hollywood movie. Now, new A.I. technology is generating blueprints for microscopic biological mechanisms that can edit your DNA, pointing to a future when scientists can battle illness and diseases with even greater precision and speed than they can today.
FlexAI Launches with $30 Million in Seed Funding to Deliver Universal AI Compute. Ex-Apple, Intel, NVIDIA, and Tesla veterans rearchitect compute infrastructure to accelerate AI innovation. FlexAI, the universal AI compute company, today launched with $30 million (€28.5 million) in seed funding led by Alpha Intelligence Capital (AIC), Elaia Partners, and Heartcore Capital.
Report: Google will update Gemini Nano in time for Galaxy S25. Google’s Gemini AI models are constantly advancing, so it comes as no surprise that a new report claims Google will have a “version 2” of Gemini Nano available by the time the Galaxy S25 launches next year.
Microsoft’s heavy bet on AI pays off as it beats expectations in the latest quarter. World’s largest public company reports $61.86bn revenue after investing billions into artificial intelligence
Alphabet hails ‘once-in-a-generation’ AI opportunity as revenue rises. Shares surge after tech giant issues first-ever dividend and posts revenue of $80.5bn, up 15% since last year, despite staff turmoil
Meta value falls $190bn as investors react to plan to increase spending on AI. Shares slumped 15% after Mark Zuckerberg said AI spending would have to grow before Meta could make much revenue from products
Snowflake Arctic - LLM for Enterprise AI. The enterprise-grade LLM known as Snowflake Arctic, developed by the Snowflake AI Research Team, outperforms competitors in instruction-following benchmarks, coding, and SQL creation at a quarter of the usual cost. Arctic makes sophisticated LLM capabilities available to a larger audience by utilizing an open-source methodology and a distinctive design. Hugging Face offers the model, which will also be incorporated into other platforms and services.
Nvidia acquires AI workload management startup Run:ai for $700M, sources say. Nvidia is acquiring Run:ai, a Tel Aviv-based company that makes it easier for developers and operations teams to manage and optimize their AI hardware infrastructure. Terms of the deal aren’t being disclosed publicly, but two sources close to the matter tell TechCrunch that the price tag was $700 million
Apple has acquired the Paris-based artificial intelligence startup Datakalab amid its push to deliver on-device AI tools. Apple has acquired the Paris-based artificial intelligence startup Datakalab amid its push to deliver on-device AI tools.
Drake Uses AI Tupac and Snoop Dogg Vocals on ‘Taylor Made Freestyle,’ References Taylor Swift’s New Album ‘The Tortured Poets Department’. On Friday night (April 19), the rapper released a song on his social media entitled “Taylor Made Freestyle,” which uses AI vocals from Tupac Shakur and Snoop Dogg on a stopgap between diss records as he awaits Kendrick Lamar’s reply to his freshly released “Push Ups.”

Resources

Link description
Fine-tune Llama 3 with ORPO. ORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.
Mistral Common. Mistral-common is a set of tools to help you work with Mistral models. Our first release contains tokenization. Our tokenizers go beyond the usual text <-> tokens, adding parsing of tools and structured conversation. We also release the validation and normalization code that is used in our API.
LongEmbed. This repository is the official implementation for the paper "LongEmbed: Extending Embedding Models for Long Context Retrieval"
FineWeb: 15T high quality web tokens. 15T tokens were used to train the most recent Llama 3 models. This new dataset yields high-quality models and includes a large deduplicated corpus from the common crawl.
A Visual Guide to Vision Transformers. This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and what the flow of the data through the model looks like.
The Cauldron VLM data. 50 language and vision datasets merged into a single format to enable better model training.
MAexpA Generic Platform for RL-based Multi-Agent Exploration. MAexp, a generic high-efficiency platform designed for multi-agent exploration, encompassing a diverse range of scenarios and MARL algorithms.
Practitioners Guide to Triton. A high-level language for creating low-level CUDA kernels is called Triton. It lets you write in a Python-style format and significantly improves the efficiency of your AI model.
Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora. Great blog covering a quick and efficient fine-tuning method using PyTorch on the recent Llama 3 model.
Layer Pruning of Large Language Models. This repository hosts the unofficial implementation of a layer pruning strategy for Large Language Models (LLMs) based on the insights from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" by Andrey Gromov et al.
A Trivial Jailbreak Against Llama 3. A trivial programmatic Llama 3 jailbreak.
LLaMA3-Quantization. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMa3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMa3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression.
Instructor: Structured LLM Outputs. Instructor is a Python library that makes it a breeze to work with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API to manage validation, retries, and streaming responses. Get ready to supercharge your LLM workflows!
How does ChatGPT work? As explained by the ChatGPT team. Sometimes the best explanations of how a technology solution works come from the software engineers who built it. To explain how ChatGPT (and other large language models) operate, I turned to the ChatGPT engineering team.
BitBLAS. A collection of GPU-accelerated kernels for BitNet-style model training has been made available by Microsoft. These devices offer a significant reduction in memory usage without sacrificing much accuracy.
CoreNet: A library for training deep neural networks. CoreNet is a deep neural network toolkit from Apple that allows researchers and engineers to train standard and novel small and large-scale models for variety of tasks, including foundation models (e.g., CLIP and LLM), object classification, object detection, and semantic segmentation.
MaxText. MaxText is a high-performance, highly scalable, open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference. MaxText achieves high MFUs and scales from single hosts to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
Cohere Toolkit. A chat interface with numerous useful capabilities for creating AI-powered chat apps has been made available by Cohere.
BAAI/Bunny-Llama-3-8B-V. Bunny is a family of lightweight but powerful multimodal models. It offers multiple plug-and-play vision encoders, like EVA-CLIP, SigLIP, and language backbones, including Llama-3-8B, Phi-1.5, StableLM-2, and Phi-2. To compensate for the decrease in model size, we construct more informative training data by curated selection from a broader data source.
Finetune Llama 3 - 2x faster + 6x longer context + 68% less VRAM. 6x long context length with dramatically less VRAM usage than HF with flash attention.

Perspectives

Link description
Self-Reasoning Tokens, teaching models to think ahead. This paper presents "reasoning tokens" for language models, which produce more tokens intended to forecast future tokens instead of the one that is immediately next, improving the model's anticipatory capacity. Experiments show notable increases in prediction accuracy, indicating that more sophisticated reasoning may be possible without the need for explicit step-by-step training.
Looking for AI use-cases. This article explores the potential for transformation and the existing constraints of generative AI, such as ChatGPT. It points out that although ChatGPT performs well on simple tasks like coding and creating drafts, it has trouble with more complicated tasks that call for specialized programming. It emphasizes the necessity of a vision that links AI solutions with useful applications and stresses how difficult it is to find and incorporate these into regular workflows.
Building reliable systems out of unreliable agents. Although AI agents aren't always dependable, they can be used to create dependable systems. A few strategies are to start with basic prompts and build an iterative improvement evaluation system; to deploy with observability; to use Retrieval Augmented Generation (RAG); to think about fine-tuning the model; and to use complementary agents to strengthen each other's weaknesses and increase the overall reliability of the system.
AI leads a service-as-software paradigm shift. Many VCs are talking about AI taking a bite out of the services business. Foundation Capital believes there is $4.6 trillion worth of work to be automated, thanks to AI: both for in-house functions and outsourced services. We're entering the era of Service-as-Software.
How AI is improving climate forecasts. Researchers are using various machine-learning strategies to speed up climate modeling, reduce its energy costs and hopefully improve accuracy.
Will AI accelerate or delay the race to net-zero emissions? As artificial intelligence transforms the global economy, researchers need to explore scenarios to assess how it can help, rather than harm, the climate.
The Biggest Open-Source Week in the History of AI. The last week of March 2024 will go down as a unique moment for Open-source LLMs. China's open-source scene hits the ground running.
‘Miss AI’ is billed as a leap forward – but feels like a monumental step backward. AI models take every toxic gendered beauty norm and bundle them up into completely unrealistic package
Why reliable AI requires a paradigm shift. Hallucinations are the fundamental barrier to the widespread use of AI, and they won't be solved anytime soon.
Should Apple Kill Siri and Start Over? The vision was grand: A personal assistant in your pocket, capable of understanding and acting upon a wide array of voice commands with ease and accuracy. So what happened?

meme-of-the-week

Back to index

ML news: Week 15 - 21 April

Research

Link description
DGMamba: Domain Generalization via Generalized State Space Model. DGMamba is a new framework that makes use of the novel state space model Mamba to address domain generalization problems.
Manipulating Large Language Models to Increase Product Visibility. Search engines' extensive language models can be manipulated by adding strategic text sequences to product descriptions to promote specific products.
MindBridge: A Cross-Subject Brain Decoding Framework. MindBridge is a single model that can interpret brain activity from several subjects.
Taming Stable Diffusion for Text to 360° Panorama Image Generation. With the help of text prompts, this project presents PanFusion, a dual-branch diffusion model that creates 360-degree panoramic images. To minimize visual distortion, the technique combines the Stable Diffusion approach with a customized panoramic branch, which is further improved by a special cross-attention mechanism.
The Physics of Language Models. Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores.
The Influence Between NLP and Other Fields. attempts to measure the level of influence that NLP has over 23 different fields of study; the cross-field engagement of NLP has decreased from 0.58 in 1980 to 0.31 in 2022; the study also reveals that CS dominates NLP citations, accounting for over 80% of citations with a focus on information retrieval, AI, and ML; in general, NLP is becoming more isolated, with a rise in intra-field citations and a fall in multidisciplinary works.
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams. Researchers present a unique technique utilizing a fisheye event camera to address the difficulties in monocular egocentric 3D human motion capture, particularly in challenging lighting conditions and with rapid motions.
MPPE-DST: Mixture of Prefix Prompt Experts for LLM in Zero-Shot Dialogue State Tracking. Mixture of Prefix Prompt Experts (MPPE) is a novel approach that has been created by researchers to improve zero-shot dialogue state tracking. This technique allows knowledge to be transferred to new domains without requiring additional dataset annotations.
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding. A novel technique called Any2Point effectively transfers vision, language, and audio model capabilities into the 3D space while preserving spatial geometries.
Google’s new technique gives LLMs infinite context. A new paper by researchers at Google claims to give large language models (LLMs) the ability to work with the text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their “context window” while keeping memory and compute requirements constant.
Compression Represents Intelligence Linearly. The concept of compressing a training dataset into a model is the foundation of most contemporary AI. The model gets better the better the compression. This research establishes a high correlation between scale benchmark scores and a model's capacity to condense novel material by thoroughly demonstrating that relationship.
TransformerFAM: Feedback attention is working memory. Transformers may take care of their own latent representations thanks to TransformerFAM's feedback system. In theory, this might allow the model to process incredibly long inputs in context by adding repetition.
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length. Another lengthy context paper, but this one is about a new design that makes use of two cutting-edge weight updating techniques. In comparison, Llama 2 underperformed on the same training token count (2T). Additionally, at inference time, it scales to an indefinite context length.
STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. Retrieval-guided language models are used by Stanford's innovative research system, Storm, to generate reports for particular subjects.
Homography Guided Temporal Fusion for Road Line and Marking Segmentation. Road lines and markings must be accurately segmented for autonomous driving, however this is difficult because of sunlight, shadows, and car occlusions. The Homography Guided Fusion (HomoFusion) module employs a pixel-by-pixel attention mechanism and a unique surface normal estimator to recognize and classify obscured road lines from video frames.
LaSagnA: vLLM-based Segmentation Assistant for Complex Queries. Vision Language Models (vLLMs) sometimes face difficulties in distinguishing absent objects and handling many queries per image. To address these problems, this work presents a novel question style and integrates semantic segmentation into the training procedure.
A collective AI via lifelong learning and sharing at the edge. Here we review recent machine learning advances converging towards creating a collective machine-learned intelligence. We propose that the convergence of such scientific and technological advances will lead to the emergence of new types of scalable, resilient, and sustainable AI systems.
Challenges and opportunities in translating ethical AI principles into practice for children. This Perspective first maps the current global landscape of existing ethics guidelines for AI and analyses their correlation with children.
Mistral 8x22B Report and Instruction Model. Mixtral 8x22B is our latest open model. It sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size.
Long-form music generation with latent diffusion. Stability AI's diffusion transformer model for audio synthesis.
LaDiC: A Diffusion-based Image Captioning Model. The use of diffusion models for image-to-text generation is revisited in this work. It presents the LaDiC architecture, which improves the image captioning tasks performance of diffusion models.
LINGO-2: Driving with Natural Language. This blog introduces LINGO-2, a driving model that links vision, language, and action to explain and determine driving behavior, opening up a new dimension of control and customization for an autonomous driving experience. LINGO-2 is the first closed-loop vision-language-action driving model (VLAM) tested on public roads.
Towards a general-purpose foundation model for computational pathology. We introduce UNI, a general-purpose self-supervised model for pathology, pre-trained using more than 100 million images from over 100,000 diagnostic H&E-stained WSIs (>77 TB of data) across 20 major tissue types.
A visual-language foundation model for computational pathology. We introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image–caption pairs through task-agnostic pretraining.
FedPFT: Federated Proxy Fine-Tuning of Foundation Models. Federated Proxy Fine-Tuning (FedPFT), a novel technique created by researchers, enhances foundation models' ability to adjust for certain tasks while maintaining data privacy.
In-Context Learning State Vector with Inner and Momentum Optimization. In this research, a novel method for improving In-Context Learning (ICL) in big language models such as GPT-J and Llama-2 is presented. The authors introduce a novel optimization technique that enhances compressed representations of the model's knowledge, referred to as "state vectors."
Decomposing and Editing Predictions by Modeling Model Computation. To determine each component's precise contribution to the final result, component modeling dissects a model's prediction process into its most fundamental parts, such as attention heads and convolution filters.

News

Link description
Grok-1.5 Vision Preview. Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.
Google’s new chips look to challenge Nvidia, Microsoft, and Amazon. Google’s new AI chip is a rival to Nvidia, and its Arm-based CPU will compete with Microsoft and Amazon
OpenAI Fires Researchers For Leaking Information. After months of leaks, OpenAI has apparently fired two researchers who are said to be linked to company secrets going public.
BabyLM Challenge. The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Dr. Andrew Ng appointed to Amazon’s Board of Directors. Dr. Andrew Ng is currently the Managing General Partner of AI Fund and is joining Amazon's Board of Directors.
Creating sexually explicit deep fake images to be made offense in UK. Offenders could face jail if the image is widely shared under a proposed amendment to criminal justice bill
Leisure centers scrap biometric systems to keep tabs on staff amid UK data watchdog clampdown. Firms such as Serco and Virgin Active pull facial recognition and fingerprint scan systems used to monitor staff attendance
Introducing OpenAI Japan. We are excited to announce our first office in Asia and we’re releasing a GPT-4 custom model optimized for the Japanese language.
Adobe’s working on generative video, too. Adobe says it’s building an AI model to generate video. But it’s not revealing when this model will launch, exactly — or much about it besides the fact that it exists.
OpenAI and Meta Reportedly Preparing New AI Models Capable of Reasoning. OpenAI and Meta are on the verge of releasing the next versions of their AI models that will supposedly be capable of reasoning and planning, the Financial Times reports. But, as with any hype coming out of big tech, take it all with a grain of salt.
Humane’s Ai Pin Isn't Ready to Replace Your Phone, But One Day It Might. AI-powered wearable Humane's Ai Pin has numerous technical problems, ranging from AI assistant glitches to music streaming concerns. Though future software updates are promised, the first-generation gadget lacks crucial functions and experiences performance gaps despite its intention to create an ambient computing experience. The Ai Pin is positioned as a companion device for a more present and less screen-focused lifestyle, yet it struggles to replace conventional smartphones despite its meticulous design.
TikTok may add AI avatars that can make ads. he new feature will let advertisers and TikTok Shop sellers generate scripts for a virtual influencer to read.
Google launches Code Assist, its latest challenger to GitHub’s Copilot. At its Cloud Next conference, Google on Tuesday unveiled Gemini Code Assist, its enterprise-focused AI code completion and assistance tool.
AI traces mysterious metastatic cancers to their source. algorithm examines images of metastatic cells to identify the location of the primary tumor. Some stealthy cancers remain undetected until they have spread from their source to distant organs. Now scientists have developed an artificial intelligence (AI) tool that outperforms pathologists at identifying the origins of metastatic cancer cells that circulate in the body
Apple's iOS 18 AI will be on-device preserving privacy, and not server-side. Apple's AI push in iOS 18 is rumored to focus on privacy with processing done directly on the iPhone, that won't connect to cloud services.
Introducing ALOHA Unleashed. Google DeepMind's ALOHA Unleashed is a program that pushes the boundaries of dexterity with low-cost robots and AI.
France's Mistral AI seeks funding at $5 bln valuation, The Information reports. French tech startup Mistral AI has been speaking to investors about raising several hundred million dollars at a valuation of $5 billion, The Information reported on Tuesday.
Stability AI is giving more developers access to its next-gen text-to-image generator. Developers can now access the API for the latest version of Stability AI’s text-to-image model.
European car manufacturer will pilot Sanctuary AI’s humanoid robot. Sanctuary AI announced that it will be delivering its humanoid robot to a Magna manufacturing facility. Based in Canada, with auto manufacturing facilities in Austria, Magna manufactures and assembles cars for several Europe’s top automakers, including Mercedes, Jaguar, and BMW. As is often the nature of these deals, the parties have not disclosed how many of Sanctuary AI’s robots will be deployed.
Google Maps will use AI to help you find out-of-the-way EV chargers . The company will use AI to summarize directions to EV chargers as well as reliability and wait times.
Introducing Meta Llama 3: The most capable openly available LLM to date. Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open-source large language model. Llama 3 models will soon be available on AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM WatsonX, Microsoft Azure, NVIDIA NIM, and Snowflake, and with support from hardware platforms offered by AMD, AWS, Dell, Intel, NVIDIA, and Qualcomm.
Google’s Deep Mind AI can help engineers predict “catastrophic failure”. AI and a popular card game can help engineers predict catastrophic failure by finding the absence of a pattern.
OpenAI winds down AI image generator that blew minds and forged friendships in 2022. When OpenAI's DALL-E 2 debuted on April 6, 2022, the idea that a computer could create relatively photorealistic images on demand based on just text descriptions caught a lot of people off guard. The launch began an innovative and tumultuous period in AI history, marked by a sense of wonder and a polarizing ethical debate that reverberates in the AI space to this day. Last week, OpenAI turned off the ability for new customers to purchase generation credits for the web version of DALL-E 2, effectively killing it.
Stability AI lays off roughly 10 percent of its workforce. Stability AI laid off 20 employees just a day after announcing the expansion of access to its new flagship model. This comes after weeks of upheaval that saw its founding CEO leave the company.
The Humane AI Pin is lost in translation. Though the Humane AI Pin has a lot of drawbacks, its translation feature might be the worst.

Resources

Link description
LLM-friendly HTML conversion. Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/. Get improved output for your agent and RAG systems at no cost.
Minimal Implementation of a D3PM (Structured Denoising Diffusion Models in Discrete State-Spaces), in pytorch. This is a minimal (400 LOC), but fully faithful implementation of a D3PM Structured Denoising Diffusion Models in Discrete State-Spaces. in pytorch.
Cerule - A Tiny Mighty Vision Model. We train and release "Cerule", a tiny yet powerful Vision Language Model based on the newly released Google's Gemma-2b and Google's SigLIP.
Diffusion Models for Video Generation. This article looks at adapting image models, training diffusion models to produce video, and even producing video directly from an image model without further training.
Pile-T5. The contemporary AI workhorse is called T5. Eleuther retrained it using a more recent tokenizer and a longer training period. As a consequence, the fundamental model for encoding tasks is significantly enhanced.
GitHub Repository to File Converter. This Python script allows you to download and process files from a GitHub repository, making it easier to share code with chatbots that have large context capabilities but don't automatically download code from GitHub.
AI Index Report. The 2024 Index is our most comprehensive to date and arrives at an important moment when AI’s influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development.
Accelerating AI: Harnessing Intel(R) Gaudi(R) 3 with Ray 2.10. Ray 2.10, the most recent version from Anyscale, now supports Intel Gaudi 3. In addition to provisioning Ray Core Task and Actors on a Gaudi fleet directly through Ray Core APIs, developers can now spin up and manage their own Ray Clusters. For an enhanced experience, they can also utilize Ray Serve on Gaudi via Ray Serve APIs and set up Intel Gaudi accelerator infrastructure for use at the Ray Train layer.
Code with CodeQwen1.5. Notwithstanding these advancements, dominant coding assistants like Github Copilot, built upon proprietary LLMs, pose notable challenges in terms of cost, privacy, security, and potential copyright infringement. Today, we are delighted to introduce a new member of the Qwen1.5 open-source family, the CodeQwen1.5-7B, a specialized codeLLM built upon the Qwen1.5 language model. CodeQwen1.5-7B has been pre-trained with around 3 trillion tokens of code-related data. It supports an extensive repertoire of 92 programming languages, and it exhibits exceptional capacity in long-context understanding and generation with the ability to process information of 64K tokens.
OLMo 1.7–7B: A 24 point improvement on MMLU. Today, we’ve released an updated version of our 7 billion parameter Open Language Model, OLMo 1.7–7B. This model scores 52 on MMLU, sitting above Llama 2–7B and approaching Llama 2–13B, and outperforms Llama 2–13B on GSM8K.
Effort. With the use of the Effort library, one can alter in real-time how many calculations are made when inferring an LLM model, which can significantly increase performance while maintaining a high level of quality. Initial findings indicate that the Effort library has the potential to greatly increase LLM inference speed while preserving quality, even with modest implementation overhead. In order to further enhance the library, the author invites others to test the 0.0.1B version and offer feedback.
luminal. Luminal is a deep-learning library that uses composable compilers to achieve high performance.
SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap. A new dataset called SoccerNet-GSR aims to improve game state reconstruction from football video footage captured by a single camera.
AI Gateway. Gateway streamlines requests to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, load balancing, and can be edge-deployed for minimum latency.
moondream. a tiny vision language model that kicks ass and runs anywhere
Sentence Embeddings. Introduction to Sentence Embeddings. This series aims to demystify embeddings and show you how to use them in your projects. This first blog post will teach you how to use and scale up open-source embedding models. We’ll look into the criteria for picking an existing model, current evaluation methods, and the state of the ecosystem.

Perspectives

Link description
Does AI need a “body” to become truly intelligent? Meta researchers think so. AIs that can generate videos, quickly translate languages or write new computer code could be world-changing, but can they ever be truly intelligent? Not according to the embodiment hypothesis, which argues that human-level intelligence can only emerge if intelligence is able to sense and navigate a physical environment, the same way babies can.
Micromanaging AI. Currently, AI is classified as micromanage, which requires people to establish tasks, assess work frequently, and lead development at each stage, akin to managing high school interns. Motivation is high but competence level is rather low.
‘Eat the future, pay with your face’: my dystopian trip to an AI burger joint. If the experience of robot-served fast food dining is any indication, the future of sex robots is going to be very unpleasant
AI now beats humans at basic tasks — new benchmarks are needed, says the major report. Stanford University’s 2024 AI Index charts the meteoric rise of artificial intelligence tools. Artificial intelligence (AI) systems, such as the chatbot ChatGPT, have become so advanced that they now very nearly match or exceed human performance in tasks including reading comprehension, image classification, and competition-level mathematics, according to a new report.
Lethal dust storms blanket Asia every spring — now AI could help predict them. As the annual phenomenon once again strikes East Asia, scientists are hard at work to better predict how they will affect people.
From boom to burst, the AI bubble is only heading in one direction. No one should be surprised that artificial intelligence is following a well-worn and entirely predictable financial arc
You can't build a moat with AI. Differentiating AI is difficult, but the secret is in the unique data that is supplied into these models—not in the AI models themselves, which are becoming commodity-like. Take LLMs, for example. The performance of AI is strongly impacted by effective data engineering since applications need to integrate customer-specific data to respond accurately. Thus, rather than the AI technology itself, gaining a competitive edge in AI applications depends on creative data utilization.
Towards 1-bit Machine Learning Models. Recent works on extreme low-bit quantization such as BitNet and 1.58 bit have attracted a lot of attention in the machine learning community. The main idea is that matrix multiplication with quantized weights can be implemented without multiplications, which can potentially be a game-changer in terms of compute efficiency of large machine learning models.
From Idea to Integration: Four Steps for Founders Integrating AI. There is currently a great deal of push to incorporate AI into current goods. This brief, step-by-step manual will assist you in making the initial move.
Use game theory for climate models that really help reach net zero goals. Many countries and companies have committed to eliminating their greenhouse gas emissions by the middle of the century. Yet most of these pledges lack a clear policy pathway.
A step along the path towards AlphaFold — 50 years ago. Paring down the astronomical complexity of the protein-folding problem
The democratization of global AI governance and the role of tech companies. Can non-state multinational tech companies counteract the potential democratic deficit in the emerging global governance of AI? We argue that although they may strengthen core values of democracy such as accountability and transparency, they currently lack the right kind of authority to democratize global AI governance.
The new NeuroAI. After several decades of developments in AI, has the inspiration that can be drawn from neuroscience been exhausted? Recent initiatives make the case for taking a fresh look at the intersection between the two fields.
Connecting molecular properties with plain language. AI tools such as ChatGPT can provide responses to queries on any topic, but can such large language models accurately ‘write’ molecules as output to our specification? Results now show that models trained on general text can be tweaked with small amounts of chemical data to predict molecular properties, or to design molecules based on a target feature.
MLOps vs. Eng: Misaligned Incentives and Failure to Launch? An in-depth discussion on the difficulties and solutions associated with implementing AI models in production, as well as how MLOps varies from traditional engineering, with industry experts. They talk about how to focus as a company to truly launch and why so few ML ideas ever reach production.
Is Attention All You Need? In order to overcome Transformers' shortcomings in long-context learning, generation, and inference speed, researchers are creating alternative designs that exhibit competitive quality at smaller scales but questionable scalability. Because of the quick development in this area, the Pareto frontier will likely keep growing, opening up more opportunities for lengthier context modeling and higher throughput inference, which will ultimately lead to a bigger variety of AI use cases.
The Shifting Dynamics And Meta-Moats Of AI. Managing complex short-, mid-, and long-term dynamics while retaining elite speed and execution, owning more of the stack, obtaining unique data, and utilizing synthetic data production are all necessary for building a successful AI business. As the AI sector develops, businesses will need to adjust to changing labor dynamics, comprehend the machine they are creating, and recognize the competitive axes on which they are based in order to forge long-lasting moats and differentiate themselves from the crowd.
Integration of AI in healthcare requires an interoperable digital data ecosystem. Electronic health information, including electronic health records, is needed to develop AI tools for health, but the seamless flow of data will require standards and interoperability.
To do no harm — and the most good — with AI in health care. Drawing from real-life scenarios and insights shared at the RAISE (Responsible AI for Social and Ethical Healthcare) conference, we highlight the critical need for AI in health care (AIH) to primarily benefit patients and address current shortcomings in healthcare systems such as medical errors and access disparities.
How to support the transition to AI-powered healthcare. To make health systems more sustainable in the long-term, incentivize artificial intelligence (AI) and digital technologies that are grounded on careful testing and real-world validation.
The increasing potential and challenges of digital twins. This issue of Nature Computational Science includes a Focus that highlights recent advancements, challenges, and opportunities in the development and use of digital twins across different domains.
The Space Of Possible Minds. Sophisticated AIs are stretching the boundaries of our understanding of what it is to be human and forcing us to consider how we embody agency and true understanding in a spectrum of intelligent beings. Creating mutually beneficial relationships between radically different entities, recognizing the similarities and differences among various forms of intelligence, and developing principled frameworks for scaling our moral concern to the essential qualities of being are all necessary to navigate this new terrain.
CUDA is Still a Giant Moat for NVIDIA. NVIDIA's proprietary interconnects and CUDA software environment, in addition to its hardware, continue to solidify the company's leadership in the AI market. The ease of use and performance optimization of CUDA makes it superior to alternatives like AMD's ROCM, guaranteeing that NVIDIA's GPUs continue to be the go-to option for AI tasks. NVIDIA's dominance in AI computing is strengthened by its investments in the CUDA ecosystem and community education.

meme-of-the-week

Back to index

ML news: Week 8 - 14 April

Research

Link description
Smartphone app could help detect early-onset dementia cause, study finds. App-based cognitive tests found to be proficient at detecting frontotemporal dementia in those most at risk. Scientists have demonstrated that cognitive tests done via a smartphone app are at least as sensitive at detecting early signs of frontotemporal dementia in people with a genetic predisposition to the condition as medical evaluations performed in clinics.
Unsegment Anything by Simulating Deformation. A novel strategy called "Anything Unsegmentable" aims to prevent digital photos from being divided into discrete categories by potent AI models, potentially resolving copyright and privacy concerns.
Evaluating LLMs at Detecting Errors in LLM Responses. A benchmark called ReaLMistake has been introduced by researchers to methodically identify mistakes in lengthy language model answers.
Dynamic Prompt Optimizing for Text-to-Image Generation. Researchers have created Prompt Auto-Editing (PAE), a technique that uses diffusion models such as Imagen and Stable Diffusion to advance text-to-image generation. With the use of online reinforcement learning, this novel method dynamically modifies the weights and injection timings of particular words to automatically improve text prompts.
No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation. A system called Seg-NN simplifies the 3D segmentation procedure. These models don't have the usual domain gap problems and can quickly adapt to new, unseen classes because they don't require a lot of pre-training.
Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions. The potential of Healthcare Foundation Models (HFMs) to transform medical services is examined in this extensive survey. These models are well-suited to adapt to different healthcare activities since they have been pre-trained on a variety of data sets. This could lead to an improvement in intelligent healthcare services in a variety of scenarios.
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing. A new algorithm called SwapAnything may swap out objects in an image with other objects of your choosing without affecting the image's overall composition. Compared to other tools, it is superior since it can replace any object, not only the focal point, and it excels at ensuring that the replaced object blends seamlessly into the original image. Pretrained diffusion model, idea vectors, and inversion are employed.
UniFL:Improve Stable Diffusion via Unified Feedback Learning. UniFL is a technique that uses a pretty complex cascade of feedback steps to enhance the output quality of diffusion models. All of these help to raise the image generation models' aesthetics, preference alignment, and visual quality. The methods can be applied to enhance any image generating model, regardless of the underlying model.
Object-Aware Domain Generalization for Object Detection. In order to tackle the problem of object detection in single-domain generalization (S-DG), the novel OA-DG approach presents two new techniques: OA-Mix for data augmentation and OA-Loss for training.
VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed. Code for the latest "next-resolution prediction" project, which presents the process of creating images as a progressive prediction of progressively higher resolution. A demo notebook and inference scripts are included in the repository. Soon, the training code will be made available.
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget. SqueezeAttention is a newly developed technique that optimizes the Key-Value cache of big language models, resulting in a 30% to 70% reduction in memory usage and a doubling of throughput.
Measuring the Persuasiveness of Language Models. The Claude 3 Opus AI model was shown to closely resemble human persuasiveness in a study that looked at persuasiveness. Statistical tests and multiple comparison adjustments were used to ascertain this. Although not by a statistically significant amount, humans were marginally more convincing, highlighting a trend where larger, more complex models are becoming more credible. The most persuasive model was found to be Claude 3 Opus. The study's methodological reliability was validated by a control condition that demonstrated predictable low persuasiveness for undisputed facts.
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation. DreamView presents a novel method for turning text descriptions into 3D objects that may be extensively customized from various angles while maintaining the object's overall consistency.
Hash3D: Training-free Acceleration for 3D Generation. By adopting a hashing algorithm that takes use of feature-map redundancy across similar camera positions and diffusion time-steps, Hash3D presents a revolutionary way to accelerate 3D generative modeling.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching. An innovative method that keeps geometric structures that are sometimes lost in conventional stereo matching techniques is the Motif Channel Attention Stereo Matching Network (MoCha-Stereo).
Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression. PoLoPCAC is a lossless point cloud attribute compression technique that combines excellent adaptability and great efficiency at different point cloud densities and scales.
Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting. In order to boost surround refinement in Multi-Camera 3D Object Detection (MC3D-Det), a field enhanced by bird's-eye view technologies, this study introduces a weak-to-strong eliciting framework.
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models. This project introduces InstantMesh, a framework with unparalleled quality and scalability that creates 3D meshes instantaneously from a single image.
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers? A recent study examined the ways in which different layers within huge language models understand distinct concepts. It was discovered that while more complicated tasks demand deeper processing, simpler tasks are handled by earlier layers.
SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection. SplatPose is a revolutionary approach that uses 3D Gaussian splatting to address the problem of anomaly identification in 3D objects from different positions.

News

Link description
Facebook and Instagram to label digitally altered content ‘made with AI’. Parent company Meta also to add ‘high-risk’ label to Al-altered content that deceives the public on "a matter of importance"
Google considering charge for internet searches with AI, reports say. Cost of artificial intelligence service could mean leaders in sector turning to subscription models
Apple lays off 600 workers in California after shuttering self-driving car project. Tech company cuts employees from eight offices in Santa Clara in its first big wave of post-pandemic job cuts
AMD to open source Micro Engine Scheduler firmware for Radeon GPUs. AMD plans to document and open source its Micro Engine Scheduler (MES) firmware for GPUs, giving users more control over Radeon graphics cards.
Investors in talks to help Elon Musk's xAI raise $3 billion: report. Investors close to Elon Musk are in talks to help his artificial-intelligence startup xAI raise $3 billion in a round that would value the company at $18 billion, the Wall Street Journal reported on Friday.
Introducing Command R+: A Scalable LLM Built for Business. Command R+, a potent, scalable LLM with multilingual coverage in ten important languages and tool use capabilities, has been launched by Cohere. It is intended for use in enterprise use scenarios.
Qwen1.5-32B: Fitting the Capstone of the Qwen1.5 Language Model Series. A growing consensus within the field now points to a model with approximately 30 billion parameters as the optimal “sweet spot” for achieving both strong performance and manageable resource requirements. In response to this trend, we are proud to unveil the latest additions to our Qwen1.5 language model series: Qwen1.5-32B and Qwen1.5-32B-Chat.
Nvidia Tops Llama 2, Stable Diffusion Speed Trials . Now that we’re firmly in the age of massive generative AI, it’s time to add two such behemoths, Llama 2 70B and Stable Diffusion XL, to MLPerf’s inferencing tests. Version 4.0 of the benchmark tests more than 8,500 results from 23 submitting organizations. As has been the case from the beginning, computers with Nvidia GPUs came out on top, particularly those with its H200 processor. But AI accelerators from Intel and Qualcomm were in the mix as well.
Rabbit partners with ElevenLabs to power voice commands on its device. Hardware maker Rabbit has tapped a partnership with ElevenLabs to power voice commands on its devices. Rabbit is set to ship the first set of r1 devices next month after getting a ton of attention at the Consumer Electronics Show (CES) at the start of the year.
DALL-E now lets you edit images in ChatGPT. Tweak your AI creations without leaving the chat.
Jony Ive and OpenAI's Sam Altman Seeking Funding for Personal AI Device. OpenAI CEO Sam Altman and former Apple design chief Jony Ive have officially teamed up to design an AI-powered personal device and are seeking funding, reports The Information.
Hugging Face TGI Reverts to Open Source License. Hugging Face temporarily granted a non-commercial license for their well-known and potent inference server in an effort to deter bigger companies from running a rival offering. While community involvement decreased, business outcomes remained unchanged. It is now back to a license that is more liberal.
Securing Canada’s AI advantage. To support Canada's AI industry, Prime Minister Justin Trudeau unveiled a $2.4 billion investment package beginning with Budget 2024. The package comprises tools to enable ethical AI adoption, support for AI start-ups, and financing for computational skills. These policies are intended to maintain Canada's competitive advantage in AI globally, boost productivity, and hasten the growth of jobs. The money will also be used to fortify the Artificial Intelligence and Data Act's enforcement as well as establish a Canadian AI Safety Institute.
Yahoo is buying Artifact, the AI news app from the Instagram co-founders. Instagram’s co-founders built a powerful and useful tool for recommending news to readers — but could never quite get it to scale. Yahoo has hundreds of millions of readers — but could use a dose of tech-forward cool to separate it from all the internet’s other news aggregators.
Now there’s an AI gas station with robot fry cooks. There’s a little-known hack in rural America: you can get the best fried food at the gas station (or in the case of a place I went to on my last road trip, shockingly good tikka masala). Now, one convenience store chain wants to change that with a robotic fry cook that it’s bringing to a place once inhabited by a person who may or may not smell like a recent smoke break and cooks up a mean fried chicken liver.
Elon Musk predicts superhuman AI will be smarter than people next year. His claims come with a caveat that shortages of training chips and growing demand for power could limit plans in the near term
Gemma Family Expands with Models Tailored for Developers and Researchers. Google announced the first round of additions to the Gemma family, expanding the possibilities for ML developers to innovate responsibly: CodeGemma for code completion and generation tasks as well as instruction following, and RecurrentGemma, an efficiency-optimized architecture for research experimentation.
Meta confirms that its Llama 3 open source LLM is coming in the next month. At an event in London on Tuesday, Meta confirmed that it plans an initial release of Llama 3 — the next generation of its large language model used to power generative AI assistants — within the next month.
Intel details Gaudi 3 at Vision 2024 — new AI accelerator sampling to partners now, volume production in Q3. Intel made a slew of announcements during its Vision 2024 event today, including deep-dive details of its new Gaudi 3 AI processors, which it claims offer up to 1.7X the training performance, 50% better inference, and 40% better efficiency than Nvidia’s market-leading H100 processors, but for significantly less money.
Apple's new AI model could help Siri see how iOS apps work. Apple's Ferret LLM could help allow Siri to understand the layout of apps in an iPhone display, potentially increasing the capabilities of Apple's digital assistant. Apple has been working on numerous machine learning and AI projects that it could tease at WWDC 2024. In a just-released paper, it now seems that some of that work has the potential for Siri to understand what apps and iOS itself looks like.
Aerospace AI Hackathon Projects. Together, 200 AI and aerospace experts created an amazing array of tools, including AI flight planners, AI air traffic controllers, and Apple Vision Pro flight simulators, as a means of prototyping cutting-edge solutions for the aviation and space industries.
AI race heats up as OpenAI, Google, and Mistral release new models. Launches within 12 hours of one another, and more activity expected in industry over summer
next-generation Meta Training and Inference Accelerator. The next iteration of Meta's AI accelerator chip has been revealed. Its development was centered on throughput (11 TFLOPs at int8) and chip memory (128GB at 5nm).
Google’s Gemini Pro 1.5 enters public preview on Vertex AI. Gemini 1.5 Pro, Google’s most capable generative AI model, is now available in public preview on Vertex AI, Google’s enterprise-focused AI development platform. The company announced the news during its annual Cloud Next conference, which is taking place in Las Vegas this week.
Microsoft is working on sound recognition AI technologies capable of detecting natural disasters. However, the Redmond-based tech giant is working on performant sound recognition AI technologies that would see Copilot (and any other AI model, such as ChatGPT) capable of detecting upcoming natural disasters, such as earthquakes, and storms.
Amazon scrambles for its place in the AI race. With its multibillion-dollar bet on Anthropic and its forthcoming Olympus model, Amazon is pushing hard to be a leader in AI.
Elon Musk's updated Grok AI claims to be better at coding and math. It'll be available to early testers 'in the coming days.' Elon Musk's answer to ChatGPT is getting an update to make it better at math, coding and more. Musk's xAI has launched Grok-1.5 to early testers with "improved capabilities and reasoning" and the ability to process longer contexts. The company claims it now stacks up against GPT-4, Gemini Pro 1.5, and Claude 3 Opus in several areas.
Anthropic's Haiku Beats GPT-4 Turbo in Tool Use - Sometimes. Anthropic's beta tool use API is better than GPT-4 Turbo in 50% of cases on the Berkeley Function Calling benchmark.
UK has real concerns about AI risks, says competition regulator. Concentration of power among just six big tech companies ‘could lead to winner takes all dynamics’
New bill would force AI companies to reveal use of copyrighted art. Adam Schiff introduces bill amid growing legal battle over whether major AI companies have made illegal use of copyrighted works
Randomness in computation wins computer-science ‘Nobel’. Computer scientist Avi Wigderson is known for clarifying the role of randomness in algorithms, and for studying their complexity. A leader in the field of computational theory is the latest winner of the A. M. Turing Award, sometimes described as the ‘Nobel Prize’ of computer science.
Introducing Rerank 3: A New Foundation Model for Efficient Enterprise Search & Retrieval. Rerank 3, the newest foundation model from Cohere, was developed with enterprise search and Retrieval Augmented Generation (RAG) systems in mind. The model may be integrated into any legacy program with built-in search functionality and is compatible with any database or search index. With a single line of code, Rerank 3 can improve search speed or lower the cost of running RAG applications with minimal effect on latency.
Meta to broaden labeling of AI-made content. Meta admits its current labeling policies are "too narrow" and that a stronger system is needed to deal with today's wider range of AI-generated content and other manipulated content, such as a January video that appeared to show President Biden inappropriately touching his granddaughter.
Mistral's New Model. The Mixtral-8x22B Large Language Model (LLM) is a pre-trained generative Sparse Mixture of Experts.
Waymo self-driving cars are delivering Uber Eats orders for the first time. Uber Eats customers may now receive orders delivered by one of Waymo’s self-driving cars for the first time in the Phoenix metropolitan area. It is part of a multiyear collaboration between the two companies unveiled last year.
JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars. This model of a mixture of experts was trained on a decent amount of CPU power using available datasets. It performs on par with the considerably larger and more costly Meta Llama 2 7B variant.
Google blocking links to California news outlets from search results. Tech giant is protesting proposed law that would require large online platforms to pay ‘journalism usage fee’
House votes to reapprove law allowing warrantless surveillance of US citizens. Fisa allows for monitoring of foreign communications, as well as collection of citizens’ messages and calls
Tesla settles lawsuit over 2018 fatal Autopilot crash of Apple engineer. Walter Huang was killed when his car steered into a highway barrier and Tesla will avoid questions about its technology in a trial

Resources

Link description
swe agents. SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.
Schedule-Free Learning. Faster training without schedules - no need to specify the stopping time/steps in advance!
State-of-the-art Representation Fine-Tuning (ReFT) methods. ReFT is a novel approach to language model fine-tuning that is efficient with parameters. It achieves good performance at a significantly lower cost than even PeFT.
The Top 100 AI for Work – April 2024. Following our AI Top 150, we spent the past few weeks analyzing data on the top AI platforms for work. This report shares key insights, including the AI tools you should consider adopting to work smarter, not harder.
LLocalSearch. LLocalSearch is a completely locally running search aggregator using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.
llm.c. LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of CPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.
AIOS: LLM Agent Operating System. AIOS, a Large Language Model (LLM) Agent operating system, embeds a large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.
Anthropic Tool use (function calling). Claude AI may now communicate with customized client-side tools supplied in API requests thanks to the public beta that Anthropic has released. To utilize the feature, developers need to include the 'anthropic-beta: tools-2024-04-04' header. Provided that each tool has a comprehensive JSON structure, Claude's capability can be expanded.
Flyflow. Flyflow is API middleware to optimize LLM applications, same response quality, 5x lower latency, security, and much higher token limits
ChemBench. LLMs gain importance across domains. To guide improvement, benchmarks have been developed. One of the most popular ones is BIG-bench which currently only includes two chemistry-related tasks. The goal of this project is to add more chemistry benchmark tasks in a BIG-bench compatible way and develop a pipeline to benchmark frontier and open models.
Longcontext Alpaca Training. On an H100, train more than 200k context windows using a new gradient accumulation offloading technique.
attorch. attorch is a subset of PyTorch's NN module, written purely in Python using OpenAI's Triton. Its goal is to be an easily hackable, self-contained, and readable collection of neural network modules whilst maintaining or improving upon the efficiency of PyTorch.
Policy-Guided Diffusion. A novel approach to agent training in offline environments is provided by policy-guided diffusion, which generates synthetic trajectories that closely match target policies and behavior. By producing more realistic training data, this method greatly enhances the performance of offline reinforcement learning models.
Ada-LEval. Ada-LEval is a pioneering benchmark to assess the long-context capabilities with length-adaptable questions. It comprises two challenging tasks: TSort, which involves arranging text segments into the correct order, and BestAnswer, which requires choosing the best answer to a question among multiple candidates.

Perspectives

Link description
"Time is running out": can a future of undetectable deep-fakes be avoided?. Tell-tale signs of generative AI images are disappearing as the technology improves, and experts are scrambling for new methods to counter disinformation
Four Takeaways on the Race to Amass Data for A.I. To make artificial intelligence systems more powerful, tech companies need online data to feed the technology. Here’s what to know.
TechScape: Could AI-generated content be dangerous for our health? From hyperrealistic deep fakes to videos that not only hijack our attention but also our emotions, tech seems increasingly full of "cognito-hazards"
AI can help to tailor drugs for Africa — but Africans should lead the way. Computational models that require very little data could transform biomedical and drug development research in Africa, as long as infrastructure, trained staff, and secure databases are available.
Breaking news: Scaling will never get us to AGI. In order to create artificial general intelligence, additional methods must be used because neural networks' poor capacity to generalize beyond their training data limits their reasoning and trustworthiness.
Americans’ use of ChatGPT is ticking up, but few trust its election information. It’s been more than a year since ChatGPT’s public debut set the tech world abuzz. And Americans’ use of the chatbot is ticking up: 23% of U.S. adults say they have ever used it, according to a Pew Research Center survey conducted in February, up from 18% in July 2023.
Can Demis Hassabis Save Google? Demis Hassabis, the founder of DeepMind, is currently in charge of Google's unified AI research division and hopes to keep the tech behemoth ahead of the competition in the field with innovations like AlphaGo and AlphaFold. Notwithstanding the achievements, obstacles nonetheless exist in incorporating AI into physical goods and rivalry from organizations like OpenAI's ChatGPT. Having made a substantial contribution to AI, Hassabis now has to work within Google's product strategy in order to make use of DeepMind's research breakthroughs.
Is ChatGPT corrupting peer review? Telltale words hint at AI use. A study of review reports identifies dozens of adjectives that could indicate text written with the help of chatbots.
AI-fuelled election campaigns are here — where are the rules? Political candidates are increasingly using AI-generated ‘softfakes’ to boost their campaigns. This raises deep ethical concerns.
How to break big tech’s stranglehold on AI in academia. Deep-learning artificial intelligence (AI) models have become an attractive tool for researchers in many areas of science and medicine. However, the development of these models is prohibitively expensive, owing mainly to the energy consumed in training them.
Ready or not, AI is coming to science education — and students have opinions. As educators debate whether it’s even possible to use AI safely in research and education, students are taking a role in shaping its responsible use.
‘Without these tools, I’d be lost’: how generative AI aids in accessibility. A rush to place barriers around the use of artificial intelligence in academia could disproportionately affect those who stand to benefit most.

meme-of-the-week

Back to index

ML news: Week 1 - 7 April

Research

Link description
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes. Scholars have unveiled a novel methodology for comprehending outside surroundings, surmounting challenges such as variable conditions and insufficient data that had hitherto impeded progress.
Lane-Change in Dense Traffic with Model Predictive Control and Neural Networks. This work presents a control system that emphasizes collaboration with neighboring drivers to enable safe and seamless lane changes in congested traffic by combining AI and predictive algorithms.
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs. It is difficult to run language models on phones because of latency, bandwidth, and power limitations. This study demonstrates how to obtain 30 tokens/second generation for the potent Gemma 2B model using quantization, the removal of the kv cache, and other optimizations. This is about three times quicker than other frameworks.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models. Sometimes, given an input image, Visual Language Models (VLMs) are unable to provide a response to a question. Even cutting-edge VLMs like GPT-4V have difficulties with this. This paper suggests some possible enhancements and a benchmark for VLMs that encounter intractable problems.
Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction. With its revolutionary approach to 3D scene reconstruction, Total-Decom makes it simple to edit and manipulate photographs by precisely breaking down objects from several views with little effort on the part of the user.
Mechanism for feature learning in neural networks and backpropagation-free machine learning models. proposed the deep neural feature ansatz, which states that neural feature learning occurs by up-weighting the features that are most influential on model output, a process that was formulated mathematically in terms of the average gradient outer product and was supported by numerical experiments and theoretical results. The presented mechanism provides a backpropagation-free approach for feature learning in various machine learning models, including those that previously had no such capabilities.
Teaching robots the art of human social synchrony. Humanoid robots can now learn the art of social synchrony using neural networks.
Many-shot jailbreaking. Anthropic created a method for breaking into lengthy context models. It has put these discoveries into practice and disseminated them to other organizations. This post describes the method and a few countermeasures that were implemented.
R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding. R2-Tuning is a technique created by researchers to comprehend videos by verbally cueing the system to recognize particular times.
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want. SPHINX-V, a multimodal big language model developed as part of the Draw-and-Understand project, aims to improve human-AI interaction through visual cues.
RealKIE: Five Novel Datasets for Enterprise Key Information Extraction. Enterprise AI solutions depend on the ability to extract information from datasets. It is possible to gauge general algorithmic performance for RAG applications using these five new benchmark datasets.
DiJiang: Efficient Large Language Models through Compact Kernelization. Researchers have created a novel method called DiJiang that makes use of current Transformers to create faster, leaner models without requiring a significant amount of retraining.
WcDT: World-centric Diffusion Transformer for Traffic Scene Generation. This paper presents a novel approach to autonomous vehicle driving path generation that integrates transformers and diffusion models into a system dubbed the "World-Centric Diffusion Transformer" (WcDT).
SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects. In situations when conventional monocular detectors struggle to identify huge objects, a novel 3D detection technique called SeaBird succeeds.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models. In order to evaluate if AI is capable of determining when an issue cannot be solved, this study presents the idea of Unsolvable Issue Detection (UPD) in Vision Language Models.
ASTRA - 3rd place solution for SoccerNet Action Spotting Challenge 2023. ASTRA is a Transformer-based model that may overcome issues like as action localization and data imbalance and recognize important periods in soccer matches.
Multi-Granularity Guided Fusion-in-Decoder. MGFiD introduces a multi-level evidence discernment strategy that improves the understanding and selection of pertinent information by question-answering systems.
Linear Attention Sequence Parallelism. With its creative application of linear attention, Linear Attention Sequence Parallel (LASP) presents a novel approach to effectively handling lengthy sequences in language models, outperforming conventional techniques.
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. The fact that every token consumes the same amount of predictive computation is one disadvantage of contemporary transformers. But compared to other tokens, some are far simpler to predict. With this work, DeepMind has paved the way for dynamic computing with a limited maximum by allowing models to depart early in order to spend fewer flops on certain tokens. For the same performance, there are 50% fewer failures at generation time.
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation. With InstantStyle, image personalization takes a new turn by addressing the issue of style consistency without requiring intricate fine-tuning. This framework guarantees precise and consistent visual stylization, merging style intensity with text management with a seamless integration of style-specific sections and a clever division of style and content in images.
T-GATE: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models. By splitting the process into parts for planning and revising, TGATE presents an effective method for creating visuals. By correcting some outputs early on, this strategy not only makes the creation process simpler but also surprisingly enhances image quality.

News

Link description
Announcing Grok-1.5. Grok-1.5 comes with improved reasoning capabilities and a context length of 128,000 tokens. Available on 𝕏 soon.
Microsoft & OpenAI planning $100 billion supercomputer Stargate AI. According to a report by The Information, Microsoft and OpenAI are reportedly planning a joint data center project that could reach $100 billion in cost. The project is said to culminate in the launch of a massive artificial intelligence supercomputer named “Stargate” by 2028.
In One Key A.I. Metric, China Pulls Ahead of the U.S.: Talent. China has produced a huge number of top A.I. engineers in recent years. New research shows that, by some measures, it has already eclipsed the United States.
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. Compared to Qwen1.5-7B, which contains 6.5 billion non-embedding parameters, Qwen1.5-MoE-A2.7B contains only 2.0 billion non-embedding parameters, approximately one-third of Qwen1.5-7B’s size. Notably, it achieves a 75% decrease in training expenses and accelerates inference speed by a factor of 1.74, offering substantial improvements in resource utilization without compromising performance.
“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time. Anthropic's Claude 3 is first to unseat GPT-4 for #1 since launch of Chatbot Arena in May '23.
Microsoft Copilot AI will soon run locally on PCs. Microsoft's Copilot AI service is set to run locally on PCs, Intel told Tom's Hardware. The company also said that next-gen AI PCs would require built-in neural processing units (NPUs) with over 40 TOPS (trillion operations per second) of power — beyond the capabilities of any consumer processor on the market.
Navigating the Challenges and Opportunities of Synthetic Voices. Using a 15-second audio sample, OpenAI's Voice Engine model creates speech that sounds like a speaker. Applications for it include support for non-verbal people, translation, and educational aids. Because of the possibility of abuse, OpenAI is deploying its technology cautiously.
Apple AI researchers boast useful on-device model that ‘substantially outperforms’ GPT-4. Nevertheless, Apple forges ahead with the promise of AI. In a newly published research paper, Apple’s AI gurus describe a system in which Siri can do much more than try to recognize what’s in an image. The best part? It thinks one of its models for doing this benchmarks better than ChatGPT 4.0.
Introducing Bezi AI. The capacity to ideate at the speed of thought with a limitless asset collection is a major turning point in the field of 3D design.
Robot, can you say ‘Cheese’? Columbia engineers build Emo, a silicon-clad robotic face that makes eye contact and uses two AI models to anticipate and replicate a person’s smile before the person actually smiles -- a major advance in robots predicting human facial expressions accurately, improving interactions, and building trust between humans and robots.
Billie Eilish, Nicki Minaj, Stevie Wonder, and more musicians demand protection against AI. Letter signed by more than 200 artists makes broad ask that tech firms pledge to not develop AI tools to replace human creatives
US and UK announce formal partnership on artificial intelligence safety. Countries sign memorandum to develop advanced AI model testing amid growing safety concerns
OpenAI deems its voice cloning tool too risky for general release. Delaying the Voice Engine technology rollout minimizes the potential for misinformation in an important global election year
DrugGPT: new AI tool could help doctors prescribe medicine in England. New tool may offer prescription ‘safety net’ and reduce the 237m medication errors made each year in England
New York City to test AI-enabled gun scanners in the subway system. Mayor Eric Adams announced the pilot program as part of an effort to deter violence, with plans to evaluate scanners at some stations
Twitter usage in the US ‘fallen by a fifth’ since Elon Musk’s takeover. App users for a social media site, rebranded as X, down by 23% since November 2022 according to Sensor Tower
Scientists turn to AI to make beer taste even better. Researchers in Belgium use artificial intelligence to improve taste, but say the skill of the brewer remains vital
Google AI could soon use a person’s cough to diagnose disease. Machine-learning system trained on millions of human audio clips shows promise for detecting COVID-19 and tuberculosis.
Microsoft is working on an Xbox AI chatbot. Xbox employees have been testing a virtual chatbot that can help with support queries and game refunds.
Sam Altman gives up control of OpenAI Startup Fund, resolving unusual corporate venture structure. OpenAI CEO Sam Altman has transferred formal control of the eponymously firm’s named corporate venture fund to Ian Hathaway, OpenAI confirmed to TechCrunch.
You can now use ChatGPT without an account. On Monday, OpenAI began opening up ChatGPT to users without an account. It described the move as part of its mission to “make tools like ChatGPT broadly available so that people can experience the benefits of AI.” It also gives the company more training data (for those who don’t opt out) and perhaps nudges more users into creating accounts and subscribing for superior GPT-4 access instead of the older GPT-3.5 model free users get.
GENERATIVE SF: MARKETPLACES IN AI EDITION. How Instacart and Faire use AI to boost productivity and better serve their customers.
Replit launches new product in race for AI coding assistants. A Silicon Valley AI coding startup is launching a new tool that it hopes will change the way companies develop software. Replit, valued at over $1 billion and backed by venture firms like Andreessen Horowitz and Khosla Ventures, says its new product, called Replit Teams, will allow developers to collaborate in real-time on software projects while an AI agent automatically fixes coding errors.
Samsung might ‘redefine’ Bixby with Galaxy AI after all. Samsung’s big Galaxy AI push this year skipped over its voice assistant, Bixby, but that might not be forever. Earlier this year when Galaxy AI made its debut, Samsung confirmed that Bixby wasn’t going away, but that it also didn’t really have plans for any new AI features within the voice assistant. Speaking to CNBC more recently, though, Samsung is looking at changing that.
George Carlin’s estate settles lawsuit over comedian’s AI doppelganger. Suit claimed Dudesy podcast violated Carlin’s copyright, calling it ‘a casual theft of a great American artist’s work’
Opera allows users to download and use LLMs locally. Web browser company Opera announced today it will now allow users to download and use large language models (LLMs) locally on their computer. This feature is first rolled out to Opera One users who get developer stream updates and will allow users to select from over 150 models from more than 50 families.
Introducing Stable Audio 2.0. Stable Audio 2.0 sets a new standard in AI-generated audio, producing high-quality, full tracks with coherent musical structures up to three minutes in length at 44.1kHz stereo. The new model introduces audio-to-audio generation by allowing users to upload and transform samples using natural language prompts. Stable Audio 2.0 was exclusively trained on a licensed dataset from the AudioSparx music library, honoring opt-out requests and ensuring fair compensation for creators.
Scientists create AI models that can talk to each other and pass on skills with limited human input. Scientists modeled human-like communication skills and the transfer of knowledge between AIs — so they can teach each other to perform tasks without a huge amount of training data.
Worldcoin Foundation open sources core components of the Orb’s software. For the Worldcoin Orb, Tools for Humanity has created a robust and safe computing environment that makes use of Arm Cortex M4 microcontrollers for real-time operations and NVIDIA Jetson for processing. The Orb does neural network inference using NVIDIA's TensorRT and runs Rust applications. It runs on Orb OS, a customized GNU/Linux distribution with an emphasis on security. For cryptography, the system incorporates a secure element, and for backend authentication, it provides trusted execution environments.
Report: Google might make SGE a paid feature, not working on ad-free Search. As the Search Generative Experience (SGE) nears its one-year anniversary, Google is reportedly considering making it a paid feature, but is not considering an ad-free offering.
Lambda Announces $500M GPU-Backed Facility to Expand Cloud for AI. Lambda, the GPU cloud company founded by AI engineers and powered by NVIDIA GPUs, today announced that it has secured a special purpose GPU financing vehicle of up to $500 million to fund the expansion of its on-demand cloud offering.
OpenAI expands its custom model training program. OpenAI is expanding a program, Custom Model, to help enterprise customers develop tailored generative AI models using its technology for specific use cases, domains, and applications.
Former Snap AI chief launches Higgsfield to take on OpenAI’s Sora video generator. OpenAI captivated the tech world a few months back with a generative AI model, Sora, that turns scene descriptions into original videos — no cameras or film crews required. But Sora has so far been tightly gated, and the firm seems to be aiming it toward well-funded creatives like Hollywood directors — not hobbyists or small-time marketers, necessarily.
Tesla Raising Pay for AI Engineers To Counter Poaching, Musk Says. Tesla is raising pay for its artificial intelligence (AI) engineers as it fends off poaching from the likes of OpenAI, Chief Executive Officer (CEO) Elon Musk said in a series of posts on X. The plan to boost the pay of AI staff comes as the talent wars for people well-versed in the technology heats up.
YouTube Says OpenAI Training Sora With Its Videos Would Break Rules. The use of YouTube videos to train OpenAI’s text-to-video generator would be an infraction of the platform's terms of service, YouTube Chief Executive Officer Neal Mohan said.
AI-generated YC Demo Day video. AI was utilized by a team from the latest YC cohort to create their demo day video. This is an unprecedented action taken by a firm.

Resources

Link description
Your AI Product Needs Evals. How to construct domain-specific LLM evaluation systems. This post outlines my thoughts on building evaluation systems for LLMs-powered AI products.
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild. VoiceCraft is a token-infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
Interrupting Cow. Interruptions make conversations feel natural. Much work has focused on AI voice assistants that can be interrupted by humans, but systems that know much more than us should be able to interrupt us too.
EvoEval: Evolving Coding Benchmarks via LLM. With the help of a new benchmark suite called EvoEval, Large Language Models' coding prowess is put to the ultimate test.
Optimum-NVIDIA. Optimum-NVIDIA delivers the best inference performance on the NVIDIA platform through Hugging Face. Run LLaMA 2 at 1,200 tokens/second (up to 28x faster than the framework) by changing just a single line in your existing transformers' code.
OpenUI. Building UI components can be a slog. OpenUI aims to make the process fun, fast, and flexible. It's also a tool we're using at W&B to test and prototype our next-generation tooling for building powerful applications on top of LLM's.
openchat-3.5-0106-gemma. The highest performing Gemma model in the world. Trained with OpenChat's C-RLFT on openchat-3.5-0106 data. Achieving similar performance to Mistral-based openchat, and much better than Gemma-7b and Gemma-7b-it.
Generative AI for Beginners (Version 2) - A Course. Microsoft's well-liked course on low-code apps, prompting, vector databases, and LLMs is available on GitHub in version 2. There are eighteen lessons in it. Even though some of the material is aspirational, it's still a useful starting point for the industry.
Industry Documents Library (IDL). A huge dataset of 26m pages and 18B tokens of extremely high-quality OCR’d dataset of industrial PDF documents.
SWE-agent. SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.
chug. A library to help with efficient training for multi-modal data. Initially focused on image & document + text tasks. Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
Cosmopedia: how to create large-scale synthetic data for pre-training. The HuggingFace group demonstrates how to create synthetic data for language model pre-training by seeding, synthesizing, filtering, and scaling.
AutoQuant. HuggingFace models can be exported from this notebook into the following five quantization formats: GGUF, GPTQ, EXL2, AWQ, and HQQ.
AI Infrastructure Explained. Innovative applications of AI have captured the public’s imagination over the past year and a half. What’s less appreciated or understood is the infrastructure powering these AI-enabled technologies. But as foundational models get more powerful, we’ll need a strong technology stack that balances performance, cost, and security to enable widespread AI adoption and innovation.
Introducing world's largest synthetic open-source Text-to-SQL dataset. HuggingFace currently has 23 million text-to SQL tokens ready for use. In order to assist in producing SQL queries based on tasks involving natural language, Gretel has gathered a sizable dataset. This can support the creation of synthetic data as well as RAG applications.
Write OpenAPI with TypeSpec. Compared to JSON or YAML, TypeSpec, an API specification language created at Microsoft, provides a more succinct and understandable format for writing OpenAPI. It solves the verbosity and lack of reusable components in OpenAPI by allowing the specification of API patterns as reusable components, which streamlines code production and governance at scale. This is done by drawing inspiration from TypeScript's syntax. The flexibility and productivity gains of TypeSpec may increase the appeal of developing applications using APIs first.

Perspectives

Link description
How Autonomous Racing Is Pushing Self-Driving Cars Forward. The gritty reality of racing without drivers teaches us a lot about the future of autonomous cars.
Does AI need a “body” to become truly intelligent? Meta researchers think so. AIs that can generate videos, quickly translate languages or write new computer code could be world-changing, but can they ever be truly intelligent? Not according to the embodiment hypothesis, which argues that human-level intelligence can only emerge if intelligence is able to sense and navigate a physical environment, the same way babies can.
Nobody Knows How to Safety-Test AI. In line with government goals, Beth Barnes' NGO METR is working with prominent AI firms like OpenAI and Anthropic to create safety checks for sophisticated AI systems. The emphasis is on evaluating hazards, including AI autonomy and self-replication, with the understanding that safety assessments are still in their infancy and cannot ensure AI safety. Despite worries that the existing testing could not be sufficiently trustworthy to support the rapid progress of AI technologies, METR's work is viewed as pragmatic.
Beyond RPA: How LLMs are ushering in a new era of intelligent process automation. RPA failed to achieve the enterprise-wide deployments that were anticipated, notwithstanding a few early triumphs. Only 3% of businesses were able to successfully grow their RPA operations, according to a Deloitte report. Recent developments in AI have the potential to alter this. Because of its innovative features, LLMs are expected to drive at least a tenfold increase in market share for intelligent process automation over the next ten years.
We’re Focusing on the Wrong Kind of AI Apocalypse. When talking about AI's future, people frequently discuss dystopian scenarios rather than the present effects on jobs and misinformation. Instead of bringing about the end of the world, AI has the ability to change work into more fulfilling and productive tasks with careful integration.
How did a small developer of graphics cards for gamers suddenly become the third most valuable firm on the planet? By turning his computer chip-making company Nvidia into a vital component in the AI arms race, Jensen Huang has placed himself at the forefront of the biggest gold rush in tech history
‘It’s very easy to steal someone’s voice’: how AI is affecting video game actors. The increased use of AI to replicate the voice and movements of actors has benefits but some are concerned over how and when it might be used and who might be left short-changed
AI in Africa: Basics Over Buzz. AI’s transformative power is its utility for virtually every economic sector. However, nearly half of the population in sub-Saharan Africa lacks access to electricity, and businesses struggle under the burden of an electricity supply that is among the most expensive and unreliable on earth.
How scientists are making the most of Reddit. As X wanes, researchers are turning to Reddit for insights and data, and to better connect with the public.
Can lessons from infants solve the problems of data-greedy AI? Words and images experienced by an infant wearing sensors during their daily life have led to efficient machine learning, pointing to the power of multimodal training signals and the potentially exploitable statistics of real-life experience.
Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape. This is our tenth annual landscape and “state of the union” of the data, analytics, machine learning, and AI ecosystem.
Building AI Models is faster and cheaper than you probably think. By training or optimizing their foundation models with YC's assistance, YC companies are dispelling the myth that creating AI models takes enormous resources. In just three months, they have accomplished amazing feats like creating original proteins and producing music of a high caliber. These 25 firms have produced creative AI solutions in a variety of industries by utilizing YC's finance and technical capabilities. They show that smaller teams can achieve major improvements in AI through creativity and strategic insights.
Chinese mourners turn to AI to remember and ‘revive’ loved ones. Growing interest in services that create digital clones of the dead as millions visit graves this week for tomb-sweeping festival
When Will the GenAI Bubble Burst? That generative AI might not live up to expectations. The unprofitability of the technology, security flaws, and the innate issue of hallucinations in language models are all causes for concern. The excitement around generative AI may begin to fade unless a ground-breaking model such as GPT-5 is published by the end of 2024, addressing important difficulties and providing a game-changing application.
Inside the shadowy global battle to tame the world's most dangerous technology. This article explores the intricate global attempts to control artificial intelligence (AI), which is considered to be one of the most powerful and dangerous technologies of our day.
How to win at Vertical AI. Vertical B2B applications, where AI agents and open APIs play a critical role in rebundling and generating new business value, are where artificial intelligence truly shines. Domain-specific models provide vertical AI with an advantage in the near term, but horizontal integration into larger ecosystems is necessary for long-term success. AI agents make it possible to rebundle workflows, which transforms management procedures and gives businesses new competitive advantages across a range of industries.
Where AI Thrives, Religion May Struggle. According to a study headed by Adam Waytz and Joshua Conrad Jackson, there may be a correlation between a drop in religious beliefs and growing exposure to robotics and AI. Higher robotization countries have higher declines in religiosity. According to the study, those whose occupations involved a lot of AI had a much lower likelihood of believing in God. These associations suggest that automation technologies could have an impact on the loss of religion.

meme-of-the-week

Back to index

ML news: Week 25 - 31 March

Research

Link description
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. This paper introduces Mora, a new multi-agent framework designed to close the gap in the field of generalist video generation, mimicking the capabilities of the leading model, Sora, across a range of tasks including text-to-video and video editing. Despite achieving performance close to Sora in various tasks, Mora still faces a holistic performance gap, marking a step towards future advancements in collaborative AI agents for video generation.
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models. Text-to-image diffusion models such as Stable Diffusion are altered by Open-Vocabulary Attention Maps (OVAM), which overcome earlier restrictions by enabling the creation of attention maps for any word.
HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption. Securing data privacy with Homomorphic Encryption, HETAL's novel method of transfer learning represents a major advancement in safe AI training.
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression. This paper presents the Hash-grid Assisted Context (HAC) framework, which outperforms existing standards by achieving over 75X compression of 3D Gaussian Splatting (3DGS) data.
Shadow Generation for Composite Image Using Diffusion model. This work overcomes earlier difficulties with form and intensity accuracy to present a novel approach to producing realistic shadows in picture composition. The addition of intensity modulation modules to ControlNet and the expansion of the DESOBA dataset allowed the researchers to achieve a considerable improvement in shadow production in pictures.
View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network. The View-Decoupled Transformer (VDT) was created by researchers to address the problem of detecting subjects from disparate camera perspectives, such as those obtained from ground and aerial cameras.
ElasticDiffusion: Training-free Arbitrary Size Image Generation. Text-to-image diffusion models can now generate images in different sizes and aspect ratios without the need for extra training thanks to ElasticDiffusion, an inventive decoding technique.
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. The Large Multi-modal Model (LMM) is extended by PSALM, which adds a mask decoder and a flexible input schema to perform well in a range of picture segmentation tasks. This method not only gets beyond the drawbacks of text-only outputs, but also makes it possible for the model to comprehend and categorize complicated images with ease.
Compositional Inversion for Stable Diffusion Models. In order to solve overfitting problems, researchers have devised a novel technique to enhance the way AI generates individualized visuals. This method guarantees that the thoughts are represented in the images in a more varied and balanced manner.
Residual Dense Swin Transformer for Continuous Depth-Independent Ultrasound Imaging. With arbitrary-scale super-resolution, RDSTN is a novel network that addresses the trade-off between field-of-view and picture quality in ultrasound imaging.
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity. A new standard for text-based person retrieval is UFineBench. To aid AI in comprehending and locating persons in photos, it makes use of thorough descriptions.
SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process. By understanding refinement as a data creation process, SegRefiner is a novel model-agnostic approach that enhances object mask quality in a variety of segmentation applications. Through the use of a discrete diffusion method, it fine-tunes coarse masks pixel by pixel, improving border metrics and segmentation.
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting. Our suggestion is the VMRNN cell, a novel recurrent unit that combines the advantages of LSTM and Vision Mamba blocks. Our comprehensive tests demonstrate that, despite retaining a reduced model size, our suggested strategy achieves competitive outcomes on a range of pivot benchmarks.
Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement. In order to balance computing economy and accuracy, this research presents Salience DETR, which uses hierarchical salience filtering to improve query selection in object identification.
Universal Cell Embeddings: A Foundation Model for Cell Biology. We present the Universal Cell Embedding (UCE) foundation model. UCE was trained on a corpus of cell atlas data from humans and other species in a completely self-supervised way without any data annotations. UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. This universal cell embedding captures important biological variation despite the presence of experimental noise across diverse datasets.
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation. Using just one reference image and voice input, the AniPortrait framework can produce realistic animated portraits. This technique creates animations that are exceptional in terms of authentic facial expressions, a variety of poses, and great visual quality by first converting audio into 3D representations and then mapping them onto 2D facial landmarks.
PAID: (Prompt-guided) Attention Interpolation of Text-to-Image. Two methods, AID and its version PAID are intended to enhance image interpolation by the incorporation of text and pose conditions. Without the need for further training, these techniques guarantee the creation of images with improved consistency, smoothness, and fidelity.
The Need for Speed: Pruning Transformers with One Recipe. With the help of the OPTIN framework, transformer-based AI models can now be more effective across a range of domains without requiring retraining. Through the use of an intermediate feature distillation technique, OPTIN is able to compress networks under certain conditions with minimal impact on accuracy.
Long-form factuality in large language models. Factual information can be produced through the use of language models. Google has made available benchmarks and a dataset that demonstrate the performance of each model. This research demonstrates that language models outperform human annotators in most situations and offers advice on how to enhance a model's factuality.
CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning. A novel method for Unsupervised Domain Adaptation (UDA) is called CoDA. It learns from variances at both the scene and image levels, which aids AI models in becoming more adaptive to unlabeled, difficult settings.
Backtracing: Retrieving the Cause of the Query. This method finds the precise content—from lectures to news articles—that prompts users to ask questions online. Backtracking is a technique that seeks to assist content producers in improving their work by locating and comprehending the reasons for misunderstandings, inquisitiveness, or emotional responses.
CT-CLIP. A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities

News

Link description
Stability AI CEO resigns to ‘pursue decentralized AI’. Emad Mostaque’s resignation comes after key departures at the AI startup. And here the Company announcement.
GTC Wrap-Up: ‘We Created a Processor for the Generative AI Era,’ NVIDIA CEO Says. Kicking off the biggest GTC conference yet, NVIDIA founder and CEO Jensen Huang unveils NVIDIA Blackwell, NIM microservices, Omniverse Cloud APIs, and more.
After raising $1.3B, Inflection is eaten alive by its biggest investor, Microsoft. In June 2023, Inflection announced it had raised $1.3 billion to build what it called “more personal AI.” The lead investor was Microsoft. Today, less than a year later, Microsoft announced that it was feasting on Inflection’s body and sucking the marrow from the bones (though I think they phrased it differently).
OpenAI is pitching Sora to Hollywood. The AI company is scheduled to meet with a number of studios, talent agencies, and media executives in Los Angeles next week to discuss partnerships, sources familiar with the matter told Bloomberg.
GitHub’s latest AI tool can automatically fix code vulnerabilities. It’s a bad day for bugs. Earlier today, Sentry announced its AI Autofix feature for debugging production code and now, a few hours later, GitHub is launching the first beta of its code-scanning autofix feature for finding and fixing security vulnerabilities during the coding process.
Researchers gave AI an 'inner monologue' and it massively improved its performance. Scientists trained an AI system to think before speaking with a technique called QuietSTaR. The inner monologue improved common sense reasoning and doubled math performance.
a California city is training AI to spot homeless encampments. For the last several months, a city at the heart of Silicon Valley has been training artificial intelligence to recognize tents and cars with people living inside in what experts believe is the first experiment of its kind in the United States.
Sora: First Impressions. A compilation of Sora content generated by visual artists, designers, creative directors, and filmmakers.
Open Interpreter O1 Light. A portable speech interface that manages your home computer is called the 01 Light. It can utilize your applications, view your screen, and pick up new abilities. The open-source 01 serves as the basis for a new generation of AI gadgets.
Character Voice For Everyone. Character Voice is a set of capabilities that elevates the Character.AI experience by enabling users to hear Characters conversing with them one-on-one. The company's bigger goal is to create a multimodal interface that will enable more smooth, simple, and interesting interactions. This is the first step toward that goal.
Cerebras Systems Unveils World’s Fastest AI Chip with Whopping 4 Trillion Transistors. The 24T parameter language models may be trained using Cerebras' new wafer chip. PyTorch is supported natively.
The GPT-4 barrier has finally been broken. Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Today that barrier has finally been smashed. We have four new models, all released to the public in the last four weeks, that are benchmarking near or even above GPT-4.
China puts trust in AI to maintain largest high-speed rail network on Earth. The railway system is in better condition than when it was first built, according to peer-reviewed paper. Vast amounts of real-time data are processed by an artificial intelligence system in Beijing to identify problems before they arise, the engineers say
Microsoft to hold a special Windows and Surface AI event in May. Ahead of Build 2024, Microsoft CEO Satya Nadella will share the company’s ‘AI vision’ for both software and hardware.
AI ‘apocalypse’ could take away almost 8m jobs in the UK, says the report. Women, younger workers and lower paid are at most risk from artificial intelligence, says IPPR thinktank
Elon Musk says all Premium subscribers on X will gain access to AI chatbot Grok this week. Following Elon Musk’s xAI’s move to open source its Grok large language model earlier in March, the X owner on Tuesday said that the company formerly known as Twitter will soon offer the Grok chatbot to more paying subscribers.
OpenAI’s chatbot store is filling up with spam. TechCrunch found that the GPT Store, OpenAI’s official marketplace for GPTs, is flooded with bizarre, potentially copyright-infringing GPTs that imply a light touch where it concerns OpenAI’s moderation efforts.
Apple's big WWDC 2024 announcement may be an AI App Store. Apple's AI strategy may not necessarily be to only offer the best AI apps it can produce, but instead deliver an enhanced AI App Store that may debut at WWDC.
Mathematicians use AI to identify emerging COVID-19 variants. Scientists at The Universities of Manchester and Oxford have developed an AI framework that can identify and track new and concerning COVID-19 variants and could help with other infections in the future.
iOS 18 Reportedly Won't Feature Apple's Own ChatGPT-Like Chatbot. Bloomberg's Mark Gurman today reported that Apple is not planning to debut its own generative AI chatbot with its next major software updates, including iOS 18 for the iPhone. Instead, he reiterated that Apple has held discussions with companies such as Google, OpenAI, and Baidu about potential generative AI partnerships.
Introducing DBRX: A New State-of-the-Art Open LLM. DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs.
Amazon invests another $2.75B in Anthropic — reportedly ‘largest’ in company history. Today, Amazon announced it has finalized that investment at the full planned amount, putting in another $2.75 billion atop the $1.25 billion it originally committed last year. According to CNBC, it is Amazon’s “largest venture investment yet.”
OpenAI Is Starting To Test GPT Earning Sharing. We’re partnering with a small group of US builders to test usage-based GPT earnings. Our goal is to create a vibrant ecosystem where builders are rewarded for their creativity and impact and we look forward to collaborating with builders on the best approach to get there.
Nvidia Tops MLPerf’s Inferencing Tests. Now that we’re firmly in the age of massive generative AI, it’s time to add two such behemoths, Llama 2 70B and Stable Diffusion XL, to MLPerf’s inferencing tests. Version 4.0 of the benchmark tests more than 8,500 results from 23 submitting organizations. As has been the case from the beginning, computers with Nvidia GPUs came out on top, particularly those with its H200 processor. But AI accelerators from Intel and Qualcomm were in the mix as well.
AI21 releases Jamba Language Model. The Mamba model style is designed to outperform Transformers in terms of efficiency while maintaining performance parity. One new version with MoE layers is Jamba. With a context length of 128k tokens, it can operate at 1.6k tokens per second. It performs 67% on the benchmark for MMLU. There are weights available.
Hume introduces Empathic Voice Interface. Meet Hume’s Empathic Voice Interface (EVI), the first conversational AI with emotional intelligence.
Google starts testing AI overviews from SGE in main Google search interface. Google is now testing AI overviews in the main Google Search results, even if you have not opted into the Google Search Generative Experience labs feature. Google said this is an experience on a “subset of queries, on a small percentage of search traffic in the U.S.,” a Google spokesperson told Search Engine Land.
LLaVA-HR: High-Resolution Large Language-Vision Assistant . This repository contains the implementation of LLaVA-HR, a strong and efficient MLLM powered by our mixture-of-resolution adaptation.
Meta is adding AI to its Ray-Ban smart glasses next month. The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.
Google bringing Gemini Nano to Pixel 8 with next Feature Drop. The Pixel 8 will get Gemini Nano, in developer preview, to power Summarize in Recorder and Gboard Smart Reply. The latter allows for “higher-quality smart replies” that have “conversational awareness” and should be generated faster. On the Pixel 8 Pro, it works with WhatsApp, Line, and KakaoTalk. Meanwhile, Summarize can take a recording and generate bullet points.

Resources

Link description
Building and testing C extensions for SQLite with ChatGPT Code Interpreter. This essay goes into great detail on how to create code in a foreign language for a difficult task using ChatGPT (or any other language model). Its creator writes, compiles, and downloads new bindings for the well-known database SQLite using ChatGPT's code interpreter.
Official Mistral Fine-tuning Code. A hackathon was recently organized by Mistral. The business also published code for optimizing its language models along with version 0.2 of the 7B model. The coding is clear and easy to read.
Scalable Optimal Transport. A curated list of research works and resources on optimal transport in machine learning.
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation. AdaIR presents an all-in-one image restoration network that addresses several types of picture deterioration such as noise, blur, and haze by using frequency mining and modulation.
Turbocharged Training: Optimizing the Databricks Mosaic AI stack with FP8. The group at Databricks Mosaic has persisted in advancing language model training. They talk about the fp8 training stack and the potential advantages of decreasing precision in this post.
Low-latency Generative AI Model Serving with Ray, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM. A new collaboration between Anyscale and NVIDIA will allow users to scale generative AI models into production. Customers can enhance resource management, observability, and autoscaling by utilizing the combined capabilities of Anyscale's managed runtime environment and Ray through this integration.
Discover The Best AI Websites & Tools. 11006 AIs and 233 categories in the best AI tools directory. AI tools list & GPTs store are updated daily by ChatGPT.
codel. Fully autonomous AI Agent that can perform complicated tasks and projects using a terminal, browser, and editor.
binary vector search is better than your FP32 vectors. A crucial component of RAG pipelines is searching over embedding vectors. You may retain performance while reducing memory needs by 30x by substituting a single 0 or 1 for the fp32 numbers, followed by a KNN clustering and reranked.
Deepfake Generation and Detection: A Benchmark and Survey. This thorough analysis explores the developments and difficulties around deepfake technology and its detection, emphasizing the arms race between those who produce deepfakes and those who are creating systems to identify them.
Evaluate LLMs in real-time with Street Fighter III. Make LLMs fight each other in real-time in Street Fighter III. Each player is controlled by an LLM. We send to the LLM a text description of the screen. The LLM decides on the next moves its character will make. The next moves depend on its previous moves, the moves of its opponents, its power, and health bars.
Superpipe. Superipe is a lightweight framework to build, evaluate and optimize LLM pipelines for structured outputs: data labeling, extraction, classification, and tagging. Evaluate pipelines on your own data and optimize models, prompts, and other parameters for the best accuracy, cost, and speed.

Perspectives

Link description
How People Are Really Using GenAI. There are many use cases for generative AI, spanning a vast number of areas of domestic and work life. Looking through thousands of comments on sites such as Reddit and Quora, the author’s team found that the use of this technology is as wide-ranging as the problems we encounter in our lives. The 100 categories they identified can be divided into six top-level themes, which give an immediate sense of what generative AI is being used for: Technical Assistance & Troubleshooting (23%), Content Creation & Editing (22%), Personal & Professional Support (17%), Learning & Education (15%), Creativity & Recreation (13%), Research, Analysis & Decision Making (10%).
Untangling concerns about consolidation in AI. Microsoft's recent acquisition of Inflection's talent sparked discussions about the largest tech giants having too much influence over AI research and development. Although they have the resources to work quickly on basic language models, there are legitimate concerns that the concentration of power would stifle transparency and innovation. This article examines the intricate trade-offs that arise as artificial intelligence becomes more widely used.
‘A landmark moment’: scientists use AI to design antibodies from scratch. Modified protein-design tool could make it easier to tackle challenging drug targets — but AI antibodies are still a long way from reaching the clinic.
TechScape: Is the US calling time on Apple’s smartphone domination? The tech giant fights regulators on both sides of the Atlantic, as the US government launches a grab-bag of accusations. Plus, Elon Musk’s bad day in court
Go, Python, Rust, and production AI applications. The roles of Python, Go, and Rust in developing AI applications are covered in this article: Go is used for larger-scale production, Python is used for developing AI models, and Rust is used for tasks requiring high performance. It highlights the significance of choosing the appropriate language for the task based on the ecosystem and tool fit, speculating that Go may replace Python as the production language. The author promotes connecting the Go and Python communities to improve the development of AI applications.
Trends in Synthetic Biology & AI in Drug Discovery in 2024. 2024 promises to be a historic year for artificial intelligence in drug discovery, with significant progress being made in synthetic biology. The synthesis of modular biological components and the impact of generative AI on research are two prominent themes that are highlighted in this article. The entry of Insilico Medicine's AI-powered candidate into Phase II clinical trials demonstrates how the combination of artificial intelligence and synthetic biology is speeding up the drug discovery process.
LLMs have special intelligence, not general, and that's plenty. In sophisticated cognitive tests, Anthropic's new AI model Claude-3 performs better than other models, including GPT-4, and above the average human IQ. Even with this success, Claude-3 still finds it difficult to solve simple puzzles and other basic tasks that people take for granted. Rather than having general intelligence like that of humans, LLMs can have a "Special Intelligence." They can be creatively reflecting back to us what they know.
AI SaaS Companies Will Be More Profitable. The deflationary impacts of AI in marketing, sales, operations, and software development could mean that while AI software companies may initially incur higher costs, they could end up being more profitable than traditional SaaS companies.
AI image generators often give racist and sexist results: can they be fixed? Researchers are tracing sources of racial and gender bias in images generated by artificial intelligence, and making efforts to fix them.
How AI is improving climate forecasts. Researchers are using various machine-learning strategies to speed up climate modelling, reduce its energy costs and hopefully improve accuracy.
Here’s why AI search engines really can’t kill Google. The AI search tools are getting better — but they don’t yet understand what a search engine really is and how we really use them.
Inside the shadowy global battle to tame the world's most dangerous technology. The problem of controlling AI is one that the world is now facing. Global leaders, tech executives, and legislators convened many high-profile meetings and conferences that exposed disagreements and differences over how to regulate this game-changing technology.
Hackers can read private AI-assistant chats even though they’re encrypted. All non-Google chat GPTs affected by side channel that leaks responses sent to users.
Towards 1-bit Machine Learning Models. Recent works on extreme low-bit quantization such as BitNet and 1.58 bit have attracted a lot of attention in the machine learning community. The main idea is that matrix multiplication with quantized weights can be implemented without multiplications, which can potentially be a game-changer in terms of compute efficiency of large machine learning models.
AI escape velocity. The law of accelerating returns, which holds that progress is made at an exponential pace over time, was created by AI futurist Ray Kurzweil. Kurzweil covered a wide range of subjects in a recent talk, such as prospects that are only going to get better, the future of the AI economy, human relationships with AIs, lifespan escape velocity, and much more.
Plentiful, high-paying jobs in the age of AI. Experts in AI are investigating automating human functions, raising fears about job losses and declining wages. The belief that advances in AI would eventually render human labor obsolete, however, may not be accurate. Constraints like computer power and opportunity costs may mean that humans will still have jobs in an AI-dominated future, but this is not a given.

meme-of-the-week

Back to index

ML news: Week 18 - 24 March

Research

Link description
ScoreHMR: Score-Guided Diffusion for 3D Human Recovery. We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. Here, we show the application of our approach on videos, utilizing keypoint detections and score guidance with keypoint reprojection and temporal smoothness terms.
Cappy: Outperforming and boosting large multi-task language models with a small scorer. A little model called Cappy has been taught to accept instructions and a candidate's completion, then calculate how well the completion satisfies the instructions by returning a score. It performs better on this job than significantly bigger models, indicating that it may be applied as a generation and training feedback mechanism.
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation. demonstrates how LLM reasoning and generation in long-horizon generation tasks can be greatly enhanced by iteratively revising a chain of thoughts with information retrieval; the key idea is that each thought step is revised with pertinent retrieved information to the task query, the current and past thought steps; Retrieval Augmented Thoughts (RAT) is a zero-shot prompting approach that offers notable improvements over baselines that include vanilla RAG, zero-shot CoT prompting, and other baselines. RAT can be applied to various models such as GPT-4 and CodeLlama-7B to improve long-horizon generation tasks (e.g., creative writing and embodied task planning).
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking. outlines Quiet-STaR, a generalization of STaR that enables language models (LMs) to acquire reasoning skills that are more scalable and general; Quiet-STaR gives LMs the ability to produce justifications for each token to explain the future text; it suggests a token-wise parallel sampling approach that enhances LM predictions by producing internal thoughts effectively; REINFORCE is used to improve the rationale creation.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM. suggests combining expert LLMs with a Mixture-of-Experts LLM as a more computationally efficient way to train LLMs. This method, called BTX, is shown to be more effective than training a single specialized LLM or a larger generalist LLM. It works by first training (in parallel) multiple copies of a seed LLM with specialized knowledge in different domains (i.e., expert LLMs), then combining them into a single LLM using MoE feed-forward layers. Finally, the entire unified model is fine-tuned.
Large language models surpass human experts in predicting neuroscience results. suggests using BrainBench as a benchmark to assess LLMs' capacity to forecast neuroscience outcomes; discovers that LLMs outperform experts in forecasting the results of experiments; an LLM that has been modified based on neuroscience literature has been demonstrated to do even better.
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer. Comprehensive literature analysis faces a problem due to the scientific literature's constant increase. Because of their ability to summarize, LLMs present a viable option; yet, they are not well-suited to the multimodal aspects that are common in scientific information. Uni-SMART (Universal Science Multimodal Analysis and Research Transformer) was created to fill this vacuum by understanding and analyzing the intricate multimodal data found in scientific publications.
Mechanics of Next Token Prediction with Self-Attention. Predicting the next token is a straightforward goal that triggers complex actions. This work discovered that the problem could be divided into two parts: soft composition and hard retrieval. This allowed for good overall performance and in-context learning, and the single self-attention layer was trained using gradient descent.
Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection. By combining visual transformers with knowledge distillation, YOLOX-ViT presents a novel method for object recognition in underwater robots.
GroupContrast. GroupContrast combines semantic-aware contrastive learning with segment grouping to redefine self-supervised 3D representation learning.
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification. With an emphasis on object-centric information, this study presents a novel approach to object detection in photos captured from a variety of spectrums, including RGB, near-infrared, and thermal imaging. The goal is to increase recognition accuracy by mitigating the effects of background noise.
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. Stable Diffusion 3 is a potent model for creating images. Latent Adversarial Diffusion Distillation, which keeps picture production quality constant while reducing the number of diffusion stages to 4, is shown in this study.
Distilling Datasets Into Less Than One Image. Poster Dataset Distillation (PoDD): We propose PoDD, a new dataset distillation setting for a tiny, under 1 image-per-class (IPC) budget. In this example, the standard method attains an accuracy of 35.5% on CIFAR-100 with approximately 100k pixels, and PoDD achieves an accuracy of 35.7% with less than half the pixels (roughly 40k)
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control . MineDreamer is an artificial intelligence (AI) bot that uses cutting-edge language and vision models creatively to obey intricate commands in the Minecraft universe.
DreamDA: Generative Data Augmentation with Diffusion Models. DreamDA presents a novel method of data augmentation by creating high-quality, diversified synthetic visuals that closely resemble the original data distribution using diffusion models.
Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models. The Interactive Reasoning method known as Chain-of-Spot (CoS) greatly improves the way Large Vision-Language Models (LVLMs) analyze and comprehend pictures. With CoS, LVLMs may obtain precise visual information without sacrificing picture quality by concentrating on specific regions of interest inside images in response to predetermined inquiries or commands.
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. A new method for image-based virtual try-on is called StableVITON. This approach takes advantage of the creative capacity of diffusion models that have already been trained while paying attention to garment details. StableVITON discovers semantic correspondences in the latent space of a pre-trained model between clothing and the human body.
Diffusion-based Video Translation. FRESCO is a unique method that greatly enhances the spatial-temporal consistency in video translation tasks by combining intra-frame and inter-frame correspondences.
Generalized Consistency Trajectory Models. With the introduction of Generalized Consistency Trajectory Models (GCTMs), this effort improves the capabilities of diffusion models for tasks such as image restoration and editing. By translating between any two distributions in a single step, these models simplify the procedure and enable remarkably accurate and efficient image modification.
Introducing SceneScript, a novel approach for 3D scene reconstruction. A model developed by Meta Reality Labs may convert visual input into a three-dimensional (3D) representation of a scene. The 70m parameter model has exceptional stability and operates rapidly on the device.
Scalable Diffusion Models with State Space Backbone. A novel kind of diffusion model known as Diffusion State Space Models (DiS) uses a state space backbone for image data instead of the conventional U-Net. These models are effective at producing high-quality photos with little computing work and can manage long-range relationships.
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns. PuzzleVQA is a dataset created to evaluate big multimodal models such as GPT-4V's capacity for abstract thinking.

News

Link description
Open Release of Grok-1. We are releasing the weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1.
Did OpenAI just accidentally leak the next big ChatGPT upgrade? OpenAI may have accidentally leaked details about a new AI model called GPT-4.5 Turbo. The leak suggests that GPT-4.5 Turbo will be faster, more accurate, and have a larger knowledge base than its predecessor.
Claude 3 Haiku: our fastest model yet. Today we’re releasing Claude 3 Haiku, the fastest and most affordable model in its intelligence class. With state-of-the-art vision capabilities and strong performance on industry benchmarks
Midjourney debuts feature for generating consistent characters across multiple gen AI images. The popular AI image generating service Midjourney has deployed one of its most oft-requested features: the ability to recreate characters consistently across new images.
Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments. Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.
Introducing Stable Video 3D: Quality Novel View Synthesis and 3D Generation from Single Images. Today we are releasing Stable Video 3D (SV3D), a generative model based on Stable Video Diffusion, advancing the field of 3D technology and delivering greatly improved quality and view-consistency.
Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life. Google researchers have developed a new artificial intelligence system that can generate lifelike videos of people speaking, gesturing and moving — from just a single still photo. The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage, opening up a range of potential applications while also raising concerns about deepfakes and misinformation.
Microsoft has added the GPT-4 Turbo LLM to the free version of Copilot. Microsoft is boosting the performance of its Copilot generative AI chatbot today. It has been confirmed that all free Copilot users can now access the GPT-4 Turbo large language model from OpenAI.
Korean researchers power-shame Nvidia with new neural AI chip — claim 625 times less power draw, 41 times smaller. The new C-Transformer chip is claimed to be the world's first ultra-low power AI accelerator chip capable of large language model (LLM) processing.
Inflection co-founders leave for Microsoft AI. Karén Simonyan and Mustafa Suleyman are leaving Inflection to launch Microsoft AI. The next CEO will be Sean White. Additionally, a few Inflection senior team members are joining Microsoft AI.
Lilac acquired by Databricks. Lilac is a scalable, user-friendly tool for data scientists to search, cluster, and analyze any kind of text dataset with a focus on generative AI.
IBM and NASA build language models to make scientific knowledge more accessible. In a new collaboration, IBM and NASA created a suite of efficient language models by training on scientific literature. Based on the transformer architecture, these models can be used in a variety of applications, from classification and entity extraction to question-answering and information retrieval. These models achieve high performance across a variety of domains and can respond promptly. We have open-sourced the models on Hugging Face for the benefit of the scientific and academic community.
Introducing RAG 2.0. A technique for adding knowledge to a language model that can become stale is called retrieval augmented generation, or RAG. Unfortunately, outside of demonstrations, the current paradigm of "frozen RAG," in which just a portion of the pipeline is trained and the model itself is not updated, performs badly. This blog describes the next generation of RAG, where all the components are fine-tuned for the job at hand. In this system, an open model such as Mistral 7B can perform better than the conventional GPT-4 RAG.
Fitbit Using Google Gemini for New AI That Could Become Your Fitness Coach. Google is training Gemini on health data, and it's creating a new AI model for the Fitbit app that can give advice tailored to your needs.
Stable Diffusion maker leaves Stability AI. Robin Rombach helped build the tech that made Stability AI famous, now he's leaving the company
Introducing Copilot4D: A Foundation Model for Self-Driving. Waabi's Copilot4D is a ground-breaking foundation model that advances the capabilities of autonomous machines by using LiDAR data to comprehend and forecast the 3D dynamics of the environment across time.
NLX Raises $15M in Series A Funding. In March 2024, NLX extended its Series A funding to $15M, adding Comcast Ventures.
Triton Puzzles. Triton is an alternative open-source language that allows you to code at a higher level and compile to accelerators like GPU. This set is puzzles is meant to teach you how to use Triton from first principles in an interactive fashion. You will start with trivial examples and build your way up to real algorithms like Flash Attention and Quantized neural networks. These puzzles do not need to run on GPU since they use a Triton interpreter.
New Breakthrough Brings Matrix Multiplication Closer to Ideal. Researchers from Tsinghua University and UC Berkeley have made great strides in matrix multiplication, introducing a novel method that has already inspired improvements. Significant time, power, and cost savings in a variety of applications could result from this development in a fundamental computer procedure. Since the previous milestone in 2010, this is the most significant advancement in lowering the computational cost of matrix multiplication.
OpenAI could release GPT-5 in a few months: Report. OpenAI could release GPT-5, the next-generation of its groundbreaking large language model, in a few months, according to a new report.
Beijing court’s ruling that AI-generated content can be covered by copyright eschews US stand, with far-reaching implications on tech’s use. The Beijing Internet Court ruled that an AI-generated image in an intellectual property dispute was an artwork protected by copyright laws. That decision is expected to have far-reaching implications for future AI copyright disputes, which could eventually benefit Chinese Big Tech companies.
Japan’s premier AI lab launches its first model. Sakana AI develops cutting-edge models for Japanese language, vision, and picture production. In order to evolve foundation models without the need for costly retraining, it introduced an evolutionary model merging. The model merging and a description of the process are now available.
Cohere’s Command-R Enterprise Model Coming to ai.nvidia.com. The RAG-optimized Command-R model from Cohere, which is intended to help enterprises transition to large-scale production, will soon be available in the freshly released NVIDIA API catalog.
Biden-Harris Administration Announces Deal with Intel for AI Chips. Biden-Harris Administration Announces Preliminary Terms with Intel to Support Investment in U.S. Semiconductor Technology Leadership and Create Tens of Thousands of Jobs
Apple’s AI ambitions could include Google or OpenAI. The iPhone-maker is in ‘active’ talks to bring Gemini to the iPhone and has also considered using ChatGPT.
World’s first major act to regulate AI passed by European lawmakers. The European Union’s parliament on Wednesday approved the world’s first major set of regulatory ground rules to govern the mediatized artificial intelligence at the forefront of tech investment. Born in 2021, the EU AI Act divides the technology into categories of risk, ranging from “unacceptable” — which would see the technology banned — to high, medium and low hazard.

Resources

Link description
tlm - Local CLI Copilot, powered by CodeLLaMa. tlm is your CLI companion which requires nothing except your workstation. It uses the most efficient and powerful CodeLLaMa in your local environment to provide you with the best possible command line suggestions.
Multi-node LLM Training on AMD GPUs. The whole stack of technologies, including schedulers, model training software, and more, that Lamini employs to train models on AMD GPUs is described in this blog article.
clarity-upscaler. A state-of-the-art image upscaling tool.
musiclang_predict. Music Lang is an API and set of models that generate music.
Optimizing Technical Docs for LLMs. Capa.ai provides guidance on how to organize LLM documentation, including how to include troubleshooting FAQs, self-contained code snippets, segmentation into sub-products, and community forum creation.
lamini/earnings-calls-qa. This dataset contains transcripts of earning calls for various companies, along with questions and answers related to the companies' financial performance and other relevant topics.
Knowledge Conflicts for LLMs: A Survey. a summary of the prevalent problem of knowledge conflict that arises while working with LLMs; the survey article divides these conflicts into three categories: intra-memory, inter-context, and context-memory conflict. It also offers insights into the sources of these conflicts and possible solutions.
Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs. A practical guide to constructing and retrieving information from knowledge graphs in RAG applications with Neo4j and LangChain
How to Evaluate Your RAG System? Retrieval Augmented Generation (RAG) is a powerful technique that enhances output quality by retrieving relevant context from an external vector database. However, building and evaluating an RAG system can be challenging, especially when it comes to measuring performance. In this post, we'll explore the most effective metrics for each stage of your RAG pipeline and how to use them to evaluate your whole system.
Anthropic Prompt Library. Although Claude 3 has been widely used, these models use a somewhat different prompting technique. Anthropic has compiled a list of user prompts that are effective for a wide range of assignments and subjects.
Pretraining 16 language models on different tokenizers. One peculiarity of contemporary language modeling is that the model is not trained until the tokenizer has been trained. The second peculiar truth is that, on vast scales, vocabulary size doesn't appear to matter all that much.
LLM4Decompile. Reverse Engineering: Decompiling Binary Code with Large Language Models
Under The Hood: How OpenAI's Sora Model Works. In this blog post, we dive into some of the technical details behind Sora. We also talk about our current thinking around the implications of these video models. Finally, we discuss our thoughts around the compute used for training models like Sora and present projections for how that training compute compares to inference, which has meaningful indications for estimated future GPU demand.
Quiet-STaR. A reasoning framework called Quiet-Star enhances language models' capacity to produce accurate results. An eight-step model per token has been given along with the code.
MoE-Adapters4CL. Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%.
LlamaGym. Fine-tune LLM agents with online reinforcement learning
Stylized image binning algorithm. This is a tutorial on utilizing a JavaScript binning method to create an image processing application that looks like pixel art and has customizable interactive web features like sliders. By averaging pixel brightness inside bins, the binning technique transforms photos into stylized, pixelated artwork by utilizing parameters like bin size and spacing. The approach entails efficiently optimizing looping structures and modifying pixel data on HTML canvas components.
TorchTune. TorchTune is a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
MVFA-AD. Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Perspectives

Link description
What I learned from looking at 900 most popular open source AI tools. By examining the GitHub stars of well-known AI models, we can uncover some fascinating patterns. The majority of open-source AI tools appear to be geared at apps and infrastructure.
LLM inference speed of light. This article explores the "speed of light" theoretical limit for transformer-based language model inference and emphasizes the significance of memory bandwidth over computational power, showing that the ability to read data from memory rather than perform calculations is the primary constraint on inference speed and that this is an important factor to optimize and comprehend the performance of AI.
AI is bad/good actually. This article's author suggests eschewing the nebulous good/bad continuum and instead use terminology like "harmful," "helpful," "capable," and "incapable" to distinguish AI conversations. For them, AI is capable yet possibly dangerous because of unresolved problems like bias exaggeration and copyright infringement. Using these more precise phrases, the author asks readers to explain their own opinions on AI
Captain's log: the irreducible weirdness of prompting AIs. A wealth of free AI and machine learning tools can be found on the new companion website, More Useful Things. These resources highlight the amusing and useful ways in which AI-generated prompts, such as creative scenarios, can surpass human-crafted ones in tasks like solving mathematical puzzles. For more consistent prompting outcomes, the experiment emphasizes the value of adding context, few-shot learning, and chain-of-thought strategies. Though organized prompting is still an evolving art with considerable potential benefits, prompting as a talent may become less important as AI models advance and get better at inferring user intent.
AI Prompt Engineering Is Dead, Long live AI prompt engineering. According to recent studies, as AI and machine learning models get better at optimizing their own prompts, human prompt engineers might become outdated. Prompts produced by algorithms can be strange but powerful; they exceed those created by humans and significantly cut down on optimization time. Despite the potential of automatically adjusted prompts, experts predict that the need for occupations related to prompts will change rather than vanish, maybe taking the form of new positions like LLMOps (Large Language Model Operations).
The Road to Biology 2.0 Will Pass Through Black-Box Data. This year marks perhaps the zenith of expectations for AI-based breakthroughs in biology, transforming it into an engineering discipline that is programmable, predictable, and replicable. Drawing insights from AI breakthroughs in perception, natural language, and protein structure prediction, we endeavor to pinpoint the characteristics of biological problems that are most conducive to being solved by AI techniques. Subsequently, we delineate three conceptual generations of bio AI approaches in the biotech industry and contend that the most significant future breakthrough will arise from the transition away from traditional “white-box” data, understandable by humans, to novel high-throughput, low-cost AI-specific “black-box” data modalities developed in tandem with appropriate computational methods.
"AI, no ads please": 4 words to wipe out $1tn. AI poses a huge threat to ad-based platforms by slashing how many ads we see
OpenAI’s “Own Goal”. And why it is becoming increasingly difficult to take them at their word
What if it isn't happening, AGI is not coming? No matter what appears to be happening, we always have to consider what if it isn't. What If LLMs fail to turn into AGIs? Has our quest for intelligence simply unveiled our demonstrable lack thereof? Will trillions of dollars turn unpredictable hallucination machines into reliable universal productivity tools that can do anything?
How OpenAI’s text-to-video tool Sora could change science – and society. OpenAI’s debut of its impressive Sora text-to-video tool has raised important questions.
Chatbot AI makes racist judgements on the basis of dialect. Some large language models harbor hidden biases that cannot be removed using standard methods.
Could AI-designed proteins be weaponized? Scientists lay out safety guidelines. AI tools that can come up with protein structures at the push of a button should be used safely and ethically, say researchers in the field.
Three reasons why AI doesn’t model human language. Artificial intelligence (AI) is being used to develop large language models (LLMs) with considerable success. But they should not be seen as being models of how human language works and is acquired.
So … you’ve been hacked. Research institutions are under siege from cybercriminals and other digital assailants. How do you make sure you don’t let them in?
8 Google Employees Invented Modern AI. Here’s the Inside Story. They met by chance, got hooked on an idea, and wrote the “Transformers” paper—the most consequential tech breakthrough in recent history.
Using LLMs to Generate Fuzz Generators. Claude and other LLMs are capable of producing efficient fuzzes for code parsing, automating a task that has historically required a great deal of human labor. Given that fuzzing is stochastic, LLMs seem to be a good fit for producing fuzzes, even if they are usually not exact enough for static analysis. To find and exploit code vulnerabilities, a hybrid approach that combines targeted fuzzing and LLM-driven static analysis may be promising.
First Impressions of Early-Access GPT-4 Fine-Tuning. A few weeks ago we finally got access to the GPT-4 fine-tuning API (in limited early access), and were super excited to check out how well it works. We’d been a user of OpenAI’s fine-tuned models since fine-tuning the original GPT-3 Davinci model first became available.
AI and the Future of Work. High Mensa exam scores for Anthropic's most recent AI, Claude, indicate that self-improving AI is not far off and presents both prospects and existential concerns. As seen at Klarna, where a customer support AI replaced 700 workers, machine learning is already eliminating jobs. This suggests that automation is becoming more and more common. Recent layoffs at Duolingo as a result of AI's translation capabilities highlight this change and the increasing influence of AI on the nature of work in the future.
Two years later, deep learning is still faced with the same fundamental challenges. Gary Marcus revisits his forecasts two years after writing a pessimistic AI paper, and he maintains his original mistrust. Even with breakthroughs like GPT-4, basic problems like true understanding and reliable AI are still unsolved. Marcus draws the conclusion that multidisciplinary cooperation is essential to achieving AGI and that increasing data and processing capacity alone won't be enough.
From 0 to 10 million users in four years. In just four years, the AI-powered writing tool Copy.ai has amassed an amazing 10 million users.

meme-of-the-week

Back to index

ML news: Week 11 - 17 March

Research

Link description
Yi: Open Foundation Models by 01.AI. One of the most potent open language models for a long time has been the Yi model. The group has published a document that offers significant new information about how they gather data and train employees.
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models. This research uses translation to enhance safety measures in situations when direct data is not available, so taking on the task of minimizing dangerous material in AI across many languages.
Plum: Prompt Learning using Metaheuristic. In this research, a broad class of more than 100 discrete optimization techniques known as metaheuristics is presented as a potent tool for enhancing rapid learning in big language models.
ViewFusion: Towards Multi-View Consistency via Interpolated Denoising. A new technique called ViewFusion aims to enhance the way diffusion models produce images from fresh angles while maintaining the consistency of the images from one view to another.
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap. reveals that there is a reasoning gap between the current models and the proposed functional benchmarks for evaluating the reasoning abilities of LLMs, ranging from 58.35% to 80.31%. However, the authors also note that these gaps can be closed with more advanced prompting techniques.
Can Large Language Models Reason and Plan? The subject of thinking and planning for LLMs is covered in a recent position paper. The following is an overview of the author's findings: In summary, I don't have any strong evidence from anything I've read, checked, or done to suggest that LLMs engage in typical reasoning or planning. Instead, they use web-scale training to perform a type of universal approximate retrieval, which is sometimes confused for reasoning abilities, as I have explained."
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents. we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents.
Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation. The new Stealing Stable Diffusion (SSD) method improves monocular depth estimate performance in challenging settings such as low light or wet ones.
VideoElevator : Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models. Using the advantages of text-to-image models, VideoElevator presents a unique method that improves text-to-video diffusion models. Videos with better frame quality and text alignment are produced by dividing the improvement process into two parts: fine-tuning temporal motion and improving spatial quality. This is known as the plug-and-play approach.
Face2Diffusion for Fast and Editable Face Personalization. Gaussian Splatting is combined with 3D mesh geometry in SplattingAvatar to create vibrant virtual humans, introducing a novel method for producing lifelike virtual humans.
Stealing Part of a Production Language Model. By leveraging their public APIs, you may obtain parts of closed language models—like the embeddings layer—for free. A simple budget of less than $2,000 may do this.
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. DNA sequence prediction model developed on the Transformer rival Mamba platform. For a little model, it is incredibly powerful and efficient.
V3D: Video Diffusion Models are Effective 3D Generators. In order to improve 3D object production, this research presents a revolutionary method that creates detailed, high-quality objects from a single photograph.
A generalist AI agent for 3D virtual environments. We present new research on a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces. By concentrating on linear memory consumption, this study overcomes the memory limitations of conventional attention-based diffusion models and presents a novel method for producing videos using state-space models (SSMs). As tested with the UCF101 and MineRL Navigate datasets, SSMs allow the generation of lengthier video sequences with competitive quality.
SemCity: Semantic Scene Generation with Triplane Diffusion. SemCity transforms 3D scene production by emphasizing real-world outdoor environments—a problem that is sometimes disregarded because of how difficult and sparse outdoor data may be.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM. This study demonstrates how to train several models and combine them into a single Mixture-of-Experts model.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. It is difficult to evaluate language models that have been taught to code. The majority of people utilize OpenAI's HumanEval. Some open models, nevertheless, appear to overfit this standard. Coding performance may be measured while reducing contamination issues with LiveCodeBench.
Evil Geniuses: Delving into the Safety of LLM-based Agents. 'Evil Geniuses' is a virtual squad that researchers utilized in a recent study to examine the safety of LLMs. They discovered that these AI agents are less resistant to malevolent attacks, give more nuanced answers, and make it more difficult to identify improper responses.
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions. In this work, a novel backbone architecture called ViT-CoMer is presented, which improves on Vision Transformers (ViT) for dense prediction tasks without requiring pre-training.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. Apple just released a multimodal model and discussed how they trained in detail.

News

Link description
OpenAI announces new members to board of directors. Dr. Sue Desmond-Hellmann, Nicole Seligman, Fidji Simo join; Sam Altman rejoins board
So long and thanks for all the pixels: Nvidia reportedly retiring the GTX brand for good. Nvidia has stopped producing GPUs based on its Turing architecture. The last of them included the likes of the GTX 1660, 1650, and 1630 series of GPUs. Once remaining stocks sell, they'll be gone and with them, the "GTX" brand itself, leaving all Nvidia gaming graphics cards as "RTX" models.
Google’s upcoming Tensor G4 Chip set to rival Snapdragon 8 Gen 4 and Apple A18 Pro. Let’s say you’re a smartphone manufacturer aiming to develop a new model. You have two options: partner with an established chipmaker like Qualcomm or MediaTek or follow the path of Apple by designing your own custom chipset. Google has taken a similar approach, developing its in-house Tensor processors. Recent information suggests the Pixel 9 will feature the Tensor G4 chipset, promising improved heat and power management for an enhanced user experience.
Microsoft may debut its first 'AI PCs' later this month. A report suggests an OLED Surface Pro 10 and Surface Laptop 6 are imminent.
Looks like we may now know which OpenAI execs flagged concerns about Sam Altman before his ouster. Two OpenAI execs raised concerns about Sam Altman before his ouster, The New York Times reported. The outlet reported that the company's chief technology officer, Mira Murati, played a key role. Altman returned as CEO in days, leaving many unanswered questions about what happened.
Cloudflare announces Firewall for AI. Today, Cloudflare is announcing the development of a Firewall for AI, a protection layer that can be deployed in front of Large Language Models (LLMs) to identify abuses before they reach the models.
Google announces they are tackling spammy, low-quality content on Search. We’re making algorithmic enhancements to our core ranking systems to ensure we surface the most helpful information on the web and reduce unoriginal content in search results. We’re updating our spam policies to keep the lowest-quality content out of Search, like expired websites repurposed as spam repositories by new owners and obituary spam.
This week, xAI will open-source Grok. Official tweet of Elon Musk
Covariant is building ChatGPT for robots. The UC Berkeley spinout says its new AI platform can help robots think more like people. Covariant this week announced the launch of RFM-1 (Robotics Foundation Model 1).
AI solves huge problem holding back fusion power. Princeton researchers have trained an AI to predict and prevent a common problem arising during nuclear fusion reactions — and they think it might be able to solve other problems, too.
Midjourney bans all Stability AI employees over alleged data scraping. Midjourney blamed a near 24-hour service outage on ‘botnet-like activity’ from two accounts linked to the Stable Diffusion creator.
Microsoft compares The New York Times’ claims against OpenAI to Hollywood’s early fight against VCR. Microsoft is helping OpenAI fight back against claims of copyright infringement by The New York Times. The news outlet’s lawsuit, filed in December, seeks to hold Microsoft and OpenAI accountable for billions of dollars in damages. In a court filing on Monday, Microsoft accuses the publisher of “unsubstantiated” claims that the use of OpenAI’s technology is harming its business.
Introducing Devin, the first AI software engineer. Devin, a new system from Cognition, receives a 14% on the difficult SWE-Bench benchmark, which evaluates AI's capacity for writing code. GPT-4 received a 1.7% score. This model demonstrates excellent contextual learning skills.
Building Meta’s GenAI Infrastructure. The Llama 3 training infrastructure is described in this Meta blog article. It covers networking, storage, Pytorch, NCCL, and many enhancements. This will prepare the way for Meta's H100s to go online throughout the course of the remaining months of this year.
Physical Intelligence Raises $70M to Build AI-Powered Robots for Any Application. Pi differentiates itself by aiming to create software that can be applied across a wide range of robotics hardware.
Researchers create AI worms that can spread from one system to another. Worms could potentially steal data and deploy malware. Now, in a demonstration of the risks of connected, autonomous AI ecosystems, a group of researchers has created one of what they claim is the first generative AI worms—which can spread from one system to another, potentially stealing data or deploying malware in the process.
Perplexity brings Yelp data to its chatbot. Perplexity’s responses can source multiple Yelp reviews for that cafe you were considering, along with location data and other information.
Gemini now lets you tune and modify responses with a prompt. Google is launching “a more precise way for you to tune Gemini’s responses” on the web app. When selecting (by highlighting) a part of Gemini’s response to your prompt, a pencil/sparkle icon appears to “Modify selected text.” This opens a box with Regenerate, Shorter, Longer, and Remove options, as well as an open text field.
Microsoft’s neural voice tool for people with speech disabilities arrives later this year. At the Microsoft Ability summit today, the company is continuing to raise awareness about inclusive design.
Together AI $106M round of funding. we’ve raised $106M in a new round of financing led by Salesforce Ventures with participation from Coatue, and existing investors.
Autonomous Vehicle Startup Applied Intuition Hits $6B Valuation After $250M Series E. Autonomous vehicle software developer Applied Intuition locked up a $250 million Series E valuing the company at a $6 billion — a 67% uptick in value from its previous round. The deal comes even as venture funding for autonomous vehicle-related startups has been in decline in recent years.
OpenAI CTO Says It’s Releasing Sora This Year. But now, OpenAI chief technology officer Mira Murati told the Wall Street Journal that the company will publicly release Sora "later this year."
Google now wants to limit the AI-powered search spam it helped create. Ranking update targets sites "created for search engines instead of people."
OpenAI Partners With Le Monde And Prisa Media. We have partnered with international news organizations Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT.
World’s first major act to regulate AI passed by European lawmakers. The European Union’s parliament on Wednesday approved the world’s first major set of regulatory ground rules to govern the mediatized artificial intelligence at the forefront of tech investment.
Figure 01 can now have full conversations with people. Figure's robots can now hold in-depth discussions with humans thanks to the integration of OpenAI's technology. While Figure's neural networks provide quick, low-level dexterous robot operations, OpenAI's models offer high-level visual and linguistic intelligence. This X article includes a video of a human conversing with a Figure robot, teaching it how to complete tasks, explaining the rationale behind the tasks, and providing a self-evaluation of the activities' effectiveness.
Claude 3 Is The Most Human AI Yet. Claude 3, Anthropic's latest AI model, is distinguished by its "warmth," which makes it a reliable collaborator on creative writing assignments. More human-feeling and lifelike, Claude 3 is said to straddle the line between delightful deep contemplation and good thought. Though this subtlety has not been fully captured by technological benchmarks, Claude 3 is set to transform our relationship with AI in creative processes.
From Wait Times to Real-Time: Assort Health Secures $3.5 Million to Scale First Generative AI for Healthcare Call Centers. Solution Erases Long Phone Holds for Patients, Supports Overwhelmed Medical Front Desk Workers and Improves Patient Access to Physicians

Resources

Link description
DeepSpeed-FP6: The Power of FP6-Centric Serving for Large Language Models. A recent upgrade to Microsoft's robust DeepSpeed training package lets models use up to six bits per parameter. This can expedite inference by a factor of more than two.
You can now train a 70b language model at home. a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.
Retrieval-Augmented Generation for AI-Generated Content: A Survey. gives a summary of RAG's application in several generating contexts, such as code, images, and audio, and includes a taxonomy of RAG upgrades along with citations to important works.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models.
SaulLM-7B: A pioneering Large Language Model for Law. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens.
A Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval). Retrieval is a critical and complex subsystem of the RAG pipelines. After all, the LLM output is only as good as the information you provide it unless your App relies solely on the training data of the LLM. The core is measuring retrieval is assessing whether each of the retrieved results is relevant for a given query.
C4AI Command-R. C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question-answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
Artificial Intelligence Controller Interface (AICI). The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct the output of a Large Language Model (LLM) in real-time. Controllers are flexible programs capable of implementing constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations.
US Public Domain Books (English). This dataset contains more than 650,000 English books (~ 61 billion words) presumed to be in the public domain in the US which were digitized by the Internet Archive and cataloged as part of the Open Library project.
transformer-debugger. Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
VideoMamba. VideoMamba is a technology that effectively manages global dependencies and local redundancy to tackle the challenges of video interpretation.
FastV. FastV is a plug-and-play inference acceleration method for large vision language models relying on visual tokens. It could reach a 45% theoretical FLOP reduction without harming the performance through pruning redundant visual tokens in deep layers.
Maximizing training throughput using PyTorch FSDP. Together, teams from IBM and Meta have achieved 57% MFU by rapidly training potent models in parallel on huge A100 and H100 clusters.
MoAI. MoAI is a new large language and vision model that integrates auxiliary visual data from specific computer vision tasks to improve upon existing models.
superopenai: logging and caching superpowers for the openai sdk. superopenai is a minimal convenience library for logging and caching LLM requests and responses for visibility and rapid iteration during development.
TripoSR. TripoSR, a state-of-the-art open-source model for fast feedforward 3D reconstruction from a single image, collaboratively developed by Tripo AI and Stability AI.
Exploring Alternative UX Patterns for GenAI Interfaces. In the rapidly evolving landscape of GenAI interfaces, it is crucial to venture beyond the established norms. The current dominance of Quick Actions and Multi-Turn engagement patterns in these interfaces, while effective in many cases, should not limit our imagination or hinder the potential for innovation.
rerankers. Rerankers are an important part of any retrieval architecture, but they're also often more obscure than other parts of the pipeline. rerankers seeks to address this problem by providing a simple API for all popular rerankers, no matter the architecture.
skyvern. Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions.
Licensing AI Means Licensing the Whole Economy. Because artificial intelligence is a process that is essential to many different economic uses, it is not possible to regulate it like a physical thing.
Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs. A practical guide to constructing and retrieving information from knowledge graphs in RAG applications with Neo4j and LangChain
You can now train a 70b language model at home. We’re releasing an open-source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs.
pricing sheet with all popular token-based pricing providers and the top-performing models. Princing and comparison between different LLMs

Perspectives

Link description
Winning Strategies for Applied AI Companies. Key Success Factors after reviewing over 70 companies that have raised at least $7M
AI startups require new strategies: This time it’s actually different. The typical dynamics between startups and incumbents do not apply in AI as they did in previous technology revolutions like mobile and the Internet. Ignore this at your peril.
The GPT-4 barrier has finally been broken. Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Today that barrier has finally been smashed. We have four new models, all released to the public in the last four weeks, that are benchmarking near or even above GPT-4.
Embrace AI to break down barriers in publishing for people who aren’t fluent in English. E. M. Wolkovich describes having a paper rejected because of an unfounded accusation that ChatGPT was used to write it. We think that both the rejection and the bias against the use of artificial intelligence (AI) in scientific writing are misguided.
Why scientists trust AI too much — and what to do about it. Some researchers see superhuman qualities in artificial intelligence. All scientists need to be alert to the risks this creates.
The Future of Poetry. Questions about whether poems were authored by humans or artificial intelligence (AI) were given to 38 AI experts and 39 English experts. First prize went to The Human, followed by Bard, ChatGPT-4, and Claude in that order, for both writing quality and the ability to deceive respondents into thinking that the poetry was written by a human. The fact that English specialists were far better at identifying which poems were composed by AI suggests that they should be involved more in the development of upcoming AI systems.
Barack Obama on AI, free speech, and the future of the internet. The former president joined me on Decoder to discuss AI regulation, the First Amendment, and of course, what apps he has on his home screen.
AI startups require new strategies: This time it’s actually different. The typical dynamics between startups and incumbents do not apply in AI as they did in previous technology revolutions like mobile and the Internet. Ignore this at your peril.
Top AIs still fail IQ tests - When asked to read image-based questions. According to recent testing, sophisticated AI models such as ChatGPT-4 and Google's "Gemini Advanced" do poorly on visual IQ tests, receiving lower-than-average scores. Although ChatGPT-4 exhibits mediocre pattern recognition abilities, it misidentifies objects visually and makes logical mistakes, indicating a considerable difference in comparison to human intellect. These results suggest that the development of universally intelligent AI systems may still be some way off.
The Top 100 Gen AI Consumer Apps. Over 40% of the top web products are new, having entered the top 50 in the last six months, according to Andreessen Horowitz's most recent consumer analysis on the top 100 Gen AI consumer apps.
This Nvidia Cofounder Could Have Been Worth $70 Billion. Instead, He Lives Off The Grid. If Curtis Priem, Nvidia’s first CTO, had held onto all his stock, he’d be the 16th richest person in America. Instead, he sold out years ago and gave most of his fortune to his alma mater Rensselaer Polytechnic Institute.
How to thrive in a crowded enterprise AI market. At a Lightspeed event, Arvind Jain, CEO of Glean, spoke on the difficulties and solutions facing corporate AI startups. He emphasized the need to provide genuine business value, being tenacious in hiring, and placing a higher priority on product quality than speed and cost. Jain also emphasized how privacy and security issues have slowed down the deployment of generative AI tools in businesses. Glean wants to become a widely used workplace AI platform that completely transforms how people work by becoming firmly integrated into organizational operations.
As AI tools get smarter, they’re growing more covertly racist, experts find. ChatGPT and Gemini discriminate against those who speak African American Vernacular English, report shows

meme-of-the-week

Back to index

ML news: Week 4 - 10 March

Research

Link description
HyperAttention: Long-context Attention in Near-Linear Time. It's well accepted—and informally verified—that HyperAttention is the key to Gemini's incredible 1 million+ token context window's success.
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning. An attempt is made to explain the success of MuP hyperparameter transfer theoretically in this study. The greatest eigenvalue of the training loss Hessian, according to its creators, is unaffected by the network's depth or breadth.
WebArena: A Realistic Web Environment for Building Autonomous Agents. The possibility for Agents to handle a range of digital responsibilities has the community enthused. But even the most advanced general-purpose models find it difficult to accomplish jobs where people achieve more than 70% of the time. It is becoming evident that these activities could require models that have been carefully trained.
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models. Latent space smoothness in text-to-image diffusion models is a problem that is addressed by a novel method called Smooth Diffusion. With this technique, even little changes in input will result in a steady and progressive alteration of the visuals.
Rethinking Inductive Biases for Surface Normal Estimation. A technique called DSNIE significantly enhances monocular surface normal estimation, which finds use in various computer graphics fields.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition. CricaVPR presents a revolutionary method that focuses on the relationships between many photos, even when they are taken in various situations, in order to improve visual place identification.
Empowering Large Language Model Agents through Action Learning. investigates open-action learning for language agents using an iterative learning strategy that uses Python functions to create and improve actions; on each iteration, the proposed framework (LearnAct) modifies and updates available actions based on execution feedback, expanding the action space and improving action effectiveness; the LearnAct framework was tested on Robotic planning and AlfWorld environments, showing 32% improvement in agent performance in AlfWorld when compared to ReAct+Reflexion.
PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval. demonstrates how to use LLMs to integrate several approaches, such as retrieval augmentation, fine-tuning, tool utilization, and more; while the suggested framework is used in the context of urban and spatial planning, many of the insights and useful advice are applicable to other fields as well.
Evo: Long-context modeling from molecular to genome scale. Introducing Evo, a long-context biological foundation model based on the StripedHyena architecture that generalizes across the fundamental languages of biology: DNA, RNA, and proteins. Evo is capable of both prediction tasks and generative design, from molecular to whole genome scale (over 650k tokens in length). Evo is trained at a nucleotide (byte) resolution, on a large corpus of prokaryotic genomic sequences covering 2.7 million whole genomes.
Resonance RoPE: Improving Context Length Generalization of Large Language Models. To assist LLMs in comprehending and producing text in longer sequences than they were first trained on, researchers have created a new method dubbed Resonance RoPE. By using less processing power, our approach outperforms the current Rotary Position Embedding (RoPE) technique and improves model performance on lengthy texts.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World. The All-Seeing Project V2 introduces the ASMv2 model, which blends text generation, object localization, and understanding the connections between objects in images.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark. A formidable task is offered by a new dataset named GPQA, which has 448 difficult multiple-choice questions covering physics, chemistry, and biology. Even domain specialists have difficulty—they only score about 65% accuracy—while non-experts only get 34%. Only 39% of advanced AI systems, such as GPT-4, are accurate. The goal of this dataset is to provide techniques for monitoring AI results in challenging scientific problems.
SURE: SUrvey REcipes for building reliable and robust deep networks. SURE is a revolutionary strategy that integrates multiple approaches to increase the accuracy of deep neural network uncertainty predictions, particularly for image classification applications.
Stable Diffusion 3: Research Paper. Stable Diffusion 3 outperforms state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence, based on human preference evaluations. Our new Multimodal Diffusion Transformer (MMDiT) architecture uses separate sets of weights for image and language representations, which improves text understanding and spelling capabilities compared to previous versions of SD3.
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents. These days, language models are quite good at responding to queries. As a result, the majority of benchmarks in use today are saturated. 'Researchy' questions are a new breed of open-ended questions that call for several steps to complete. The source of this specific dataset is search engine queries. It includes instances where GPT-4 had trouble responding to questions.
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control. A novel method for improving motion quality and semantic coherence in films produced by text-to-video models is presented by UniCtrl. Employing motion injection and cross-frame self-attention approaches enhances video coherence and realism without requiring further training.
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT. With natural language queries, VTG-GPT provides a revolutionary GPT-based technique that can precisely identify particular video segments without the need for fine-tuning or training.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. With the same performance as OpenAI's original CLIP model, MobileClip operates seven times quicker. It may be utilized for a variety of language and visual activities on-device.
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures. Vision-RWKV provides an effective solution for high-resolution image processing by modifying the RWKV architecture from NLP for use in vision challenges.
Design2Code: How Far Are We From Automating Front-End Engineering? It's hard to take pictures of a design and turn them into code. This study suggests an 18B model as a baseline and assessments imply that we are about there for performing this on basic designs. GPT-4V-generated code is sometimes preferred to human-synthesized code.
MathScale: Scaling Instruction Tuning for Mathematical Reasoning. Researchers created two million route issues using fake data. After training a 7B model, they discovered that it performed well when compared to the most advanced big language models.
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos. The KEPP system offers a fresh method for organizing and carrying out difficult jobs. The approach, which makes use of a probabilistic knowledge network, enables the model to arrange activities logically to accomplish a goal.
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents. KnowAgent presents an innovative method for enhancing the planning abilities of big language models through the incorporation of explicit action information. The method leads LLMs through more rational planning trajectories, which improves their performance on challenging tasks.
tinyBenchmarks: evaluating LLMs with fewer examples. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. This work shows that you can reliably evaluate language model performance with as few as 100 examples from popular benchmarks.
3D Diffusion Policy. DP3 presents a novel method for imitation learning that effectively teaches robots difficult abilities by fusing diffusion strategies with 3D visual data.
Co-LLM: Learning to Decode Collaboratively with Multiple Language Models. Using an innovative approach, multiple huge language models can collaborate by alternately producing text token by token. With the use of this tactic, models are better able to apply their distinct advantages and areas of competence to a variety of activities, including following instructions, answering questions related to a given domain, and solving reasoning-based problems.

News

Link description
AI-generated images of Trump with Black voters being spread by supporters. No evidence to tie fake images, including one created by Florida radio host, to Trump campaign, BBC Panorama investigation finds
Elon Musk sues OpenAI over AI threat. OpenAI is not so open now, Musk claims, following the closed-source release of the company's artificial general intelligence technology under Microsoft.
OpenAI wants to make a walking, talking humanoid robot smarter. Figure’s founder Brett Adcock says a new partnership with OpenAI could help its robots hold conversation and learn from its mistakes over time.
MagicLab’s humanoid can toast marshmallows, fold clothes, and dance. Miniature high-torque servo actuators combined with sensitive multi-dimensional pressure sensors enabled the team to create an exceptionally dexterous hand–MagicBot.
Amazon to spend $1 billion on startups that combine AI with robots. Amazon’s $1 billion industrial innovation fund is to step up investments in companies that combine artificial intelligence and robotics, as the e-commerce giant seeks to drive efficiencies across its logistics network.
Claude 3 released. Three new Claude 3 family models have been trained by Anthropic, the best of which achieves benchmark scores that GPT4 has publicly disclosed. It excels at visual tasks and is a multimodal model as well. Claude's coding skills have significantly improved with this version, which is significant.
ChatGPT can read its answers out loud. OpenAI’s new Read Aloud feature for ChatGPT could come in handy when users are on the go by reading its responses in one of five voice options out loud to users. It is now available on both the web version of ChatGPT and the iOS and Android ChatGPT apps.
Adobe reveals a GenAI tool for music. Adobe unveiled Project Music GenAI Control, a platform that can generate audio from text descriptions (e.g. “happy dance,” “sad jazz”) or a reference melody and let users customize the results within the same workflow.
OpenAI fires back at Elon Musk in legal fight over breach of contract claims. ChatGPT maker releases emails in support of claim businessman backed plan to create for-profit unit
OpenAI and Elon Musk. In response to Elon Musk's complaint, OpenAI provided screenshots of emails between Elon Musk, Greg Brockman, Sam Altman, and Ilya Sutskever, as well as their version of events. According to the receipts, Musk thought there was little hope for OpenAI to succeed and agreed that some models should be closed-source.
Perplexity AI Reportedly Raising Additional Money At Significantly Higher Valuation Cap Than $520M. Perplexity AI, a rising star in the field of artificial intelligence, is reportedly in discussions to secure additional funding at a valuation significantly higher than its previous round.
Le Chat. Using its Mistral models, Mistral AI has introduced 'le Chat Mistral,' a new multilingual conversational assistant with an enterprise edition for companies.
Neuralink brain chip: advance sparks safety and secrecy concerns. Elon Musk announced this week that his company’s brain implant has allowed a person to move a computer mouse with their mind.
Ex-Google engineer arrested for alleged theft of AI secrets for Chinese firms. Linwei Ding, facing four counts of theft of trade secrets, accused of transferring confidential information to his personal account
Mistral x Snowflake. Snowflake, the Data Cloud company, and Mistral AI, one of Europe’s leading providers of AI solutions, today announced a global partnership to bring Mistral AI’s most powerful language models directly to Snowflake customers in the Data Cloud.
Moondream 2 small vision language model. Moondream is a tiny language model built on SigLIP and Phi-2. The benchmark performance has been much enhanced in this second edition, which is licensed for commercial use. It is perfect for describing visuals and operating on low-end computing hardware.
Driverless startup Waymo to test self-driving vehicles with no human driver in Austin. Autonomous vehicle company Waymo will begin testing driverless cars, with no human behind the wheel, in Austin, starting Wednesday.
Google brings Stack Overflow’s knowledge base to Gemini for Google Cloud. Developer Q&A site Stack Overflow is launching a new program today that will give AI companies access to its knowledge base through a new API, aptly named OverflowAPI.
Brave’s Leo AI assistant is now available to Android users. Brave is launching its AI-powered assistant, Leo, to all Android users. The assistant allows users to ask questions, translate pages, summarize pages, create content, and more. The Android launch comes a few months after Brave first launched Leo on desktop. Brave says Leo will be available on iOS devices in the coming weeks.
Inflection-2.5. A new model has been introduced by Inflection to power Pi, its personal assistant. The model achieves remarkable reasoning scores on benchmarks and performs within 94% of the GPT-4. In comparison to GPT-4, Inflection claims that training only required 40% of the computing. This post offers an intriguing discovery: a typical conversation with Pi lasts 33 minutes.
Cohere and Accenture Collaborate to Accelerate Enterprise AI Adoption. Cohere and Accenture are working together to provide over 9,000 enterprise clients with cohere embedding technology.
Microsoft’s Mistral deal beefs up Azure without spurning OpenAI. Microsoft investing in Mistral puts the focus on its Azure model offerings.

Resources

Link description
2.4x faster Gemma + 58% less VRAM. You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM.
DUSt3R. With the help of this project, you may create 3D representations in GLB form by taking a few photos of a site and reconstructing it for usage in 3D applications.
Datasets for Large Language Models: A Comprehensive Survey. an extensive (more than 180 pages) review and analysis of LLM datasets.
Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey. an overview of LLMs for tabular data jobs that includes important methods, measurements, datasets, models, and optimization strategies; it also discusses unmet issues and offers suggestions for future lines of inquiry.
Using Claude 3 Opus for video summarization. Andrej Karpathy challenged me to write a blog article based on one of his latest videos in a lengthy context. This job was completed by Claude 3 with assistance from some pre-processing data. The end product is an excellent and captivating blog post.
Dual-domain strip attention for image restoration. A new technique that greatly enhances image restoration tasks is the dual-domain strip attention mechanism.
Open-Sora-Plan. This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open-source community can contribute to this project.
ML system design: 300 case studies to learn from. We put together a database of 300 case studies from 80+ companies that share practical ML use cases and learnings from designing ML systems.
orca-math-word-problems-200k . This dataset contains ~200K grade school math word problems. All the answers in this dataset are generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the Potential of SLMs in Grade School Math for details about the dataset construction.
mlx-swift-examples. Apple created the MLX framework, which is used to train AI models on Macs. This repository demonstrates how to use Swift for model training on mobile devices. An MNIST classifier model can be trained with just one on an iPhone.
Text Clustering. A free and open-source text clustering tool that makes it simple and rapid to embed, cluster, and semantically label clusters. On 100k samples, the full pipeline runs in 10 minutes.
EasyLM. Large language models (LLMs) made easy, EasyLM is a one-stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.
You can now train a 70b language model at home. Today, we’re releasing Answer.AI’s first project: a fully open-source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.
Training Models at Scale. The goal of this tutorial is to provide a comprehensive overview of techniques and strategies used for scaling deep learning models and to provide a hands-on guide to implement these strategies from scratch in JAX with Flax using shard_map.
Genstruct 7B. Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
Fructose. Fructose is a Python package to create a dependable, strongly typed interface around an LLM call.
Efficient Multi-Head Attention Implementations. Different implementations of the widely used multi-headed attention module in contemporary LLMs varied in speed by over ten times. This notebook lists a handful and compares how well they perform.
US regulators investigate whether OpenAI investors were misled, say reports. Internal communications from CEO Sam Altman reportedly under scrutiny in SEC inquiry
Microsoft introduces Copilot AI chatbot for finance workers in Excel and Outlook. Microsoft is launching a Copilot for Finance, which it said will be able to perform a handful of common role-specific actions in Excel and Outlook.

Perspectives

Link description
On the Societal Impact of Open Foundation Models. a position paper that centers on open foundation models and discusses their advantages, disadvantages, and effects; it also suggests a framework for risk analysis and clarifies why, in certain situations, the marginal risk of these models is low. Finally, it provides a more sober evaluation of the open foundation models' effects on society.
Towards Long Context RAG. The amazing one-million-word context window that Google's Gemini 1.5 Pro has brought to the AI community has sparked a debate regarding the future viability of retrieval-augmented generation (RAG).
Aggregator’s AI Risk. The impact of the Internet, especially through Aggregators like Google and Meta, is comparable to that of the printing press on the spread of knowledge and the establishment of nation-states. However, the rise of generative AI puts the Aggregator model to the test by offering unique solutions that represent ingrained worldviews. This could undermine the Aggregator economics's universal appeal and point to the need for a move toward personalized AI in order to preserve its dominance.
Is Synthetic Data the Key to AGI?. The caliber of training data has a major impact on how effective large language models are. By 2027, projections indicate that there will be a shortage of high-quality data. A possible answer to this problem is synthetic data generation, which could change internet business models and emphasize the significance of fair data access and antitrust laws.
AI Research Internship Search as a CS PhD Student. Tips and thoughts from my relatively successful summer research internship hunt during third-year Computer Science PhD study.
How AI Could Disrupt Hollywood. New platforms and tools may allow a person to create a feature-length film from their living room. But can they really compete with the studios?
Training great LLMs entirely from ground zero in the wilderness as a startup. Reka's creator and well-known GPU critic Yi Tay detailed their experience building very powerful language models outside of Google in a blog post. The primary obstacles stem from hardware instability and cluster issues. They also had difficulties with software maturity.
Claude 3 Is The Most Human AI Yet. Anthropic's Claude 3, a large language model similar to GPT-4, is notable not so much for its cost-effectiveness or benchmark test results as for its distinctly human-like, creative, and naturalistic interaction quality. This represents a major breakthrough in AI's capacity to collaborate imaginatively with writers.
Licensing AI Means Licensing the Whole Economy. AI is a vast process employing statistical approaches, and it would be impractical to control its use across all organizations. Therefore, regulating AI like a tangible commodity is incorrect. Given AI's imminent economic ubiquity, targeted regulation for particular misuses—akin to current strategies for programming or email abuses—is more successful.
Is ChatGPT making scientists hyper-productive? The highs and lows of using AI. Large language models are transforming scientific writing and publishing. However, the productivity boost that these tools bring could have a downside.
Artificial intelligence and illusions of understanding in scientific research. Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists’ visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings.
AI will likely increase energy use and accelerate climate misinformation – report. Claims that artificial intelligence will help solve the climate crisis are misguided, warns a coalition of environmental groups
We Need Self-Driving Cars. Anyone rooting against self-driving cars is cheering for tens of thousands of deaths, year after year. We shouldn’t be burning self-driving cars in the streets. We should be celebrating…
Subprime Intelligence. Significant problems in OpenAI's Sora demonstrate the limitations of generative AI's comprehension. The technology presents both practical obstacles and revolutionary possibilities, as seen by its high computing needs and potential impact on the creative industry.
Sora, Groq, and Virtual Reality. A few years ago, Facebook's drive into the metaverse looked misguided, and the idea of the metaverse appeared like fiction from Ernest Cline's novel. Things feel different now. Groq's deterministic circuits streamline machine-learning algorithms for quicker processing, while Sora creates intricate video situations. The combination of these developments brings us one step closer to real-time video simulation and full-fledged virtual reality.
AI Is Like Water. For GenAI companies to have a competitive advantage, technology alone is no longer sufficient. This means that since the basic product is virtually the same, GenAI and bottled water are comparable. The primary differentiators need to originate from elements like distribution, user experience, perceived customer value, branding, and marketing.

meme-of-the-week

Back to index

ML news: Week 26 February - 3 March

Research

Link description
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. The RL approach REINFORCE is straightforward, well-known, and simple to comprehend. In simulators, training steadily is a challenge. In general, PPO is far more reliable and performant. REINFORCE is used by Gemini, and PPO is presumably used by GPT-4.
AlphaFold Meets Flow Matching for Generating Protein Ensembles. The protein's post-folding state can be predicted using AlphaFold. Adding invertible flow matching allows you to significantly increase modeling capability throughout the whole protein landscape.
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models. Researchers have created a new technique that focuses on "expert-level sparsification," which minimizes model size without sacrificing performance, to make LLMs more effective and user-friendly. For Mixture-of-Experts LLMs, which are strong but typically too large to manage simply, this is very helpful.
Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion. A novel method called GeneOH Diffusion enhances models' comprehension of and ability to manipulate objects with their hands. The goal of this technique is to improve the naturalness of these interactions by fixing mistakes in hand gestures and object relationships.
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis. Except Sora, Snap Research has developed a video creation model that is 3 times faster to run than the prior state of the art.
OpenCodeInterpreter. By training on a synthetic multi-turn dataset and utilizing human feedback, a model built on CodeLlama and DeepSeek Coder was able to achieve 85%+ on the HumanEval programming benchmark.
INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models. A new benchmark called INSTRUCTIR aims to improve search engines' ability to infer users' intentions. INSTRUCTIR assesses how well search engines can obey user instructions and adjust to different and evolving search needs, in contrast to existing approaches that primarily concentrate on the query itself.
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In terms of accuracy in jobs involving contacting API functions, Meta's 350m parameter language model has high reasoning performance, even coming close to Llama 7B. Although the model is not yet available, it is worthwhile to investigate the novelty in fixed parameter models.
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models. A new multilingual benchmark called ConceptMath is used to assess LLMs' arithmetic proficiency in both Chinese and English. It's special because it deconstructs arithmetic problems into discrete ideas, enabling a more thorough evaluation of an AI's mathematical prowess and shortcomings.
Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion. DreamRec proposed a revolutionary 'learning-to-generate' technique to sequential recommendation, whereby it generates a 'oracle' item representing the optimal next option for the user, as opposed to the conventional way of identifying user preferences from a mixture of positive and negative things.
FlowMDM: Seamless Human Motion Composition with Blended Positional Encodings. A novel model called FlowMDM uses text descriptions to create lengthy, continuous sequences of human movements. This groundbreaking diffusion-based model excels in accuracy and realism on important datasets by using Blended Positional Encodings to create realistic motion without the need for additional denoising stages.
VSP-LLM (Visual Speech Processing incorporated with LLMs). We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task.
Repetition Improves Language Model Embeddings. We present echo embeddings, an embedding strategy designed to address an architectural limitation of autoregressive models: that token embeddings cannot contain information from tokens that appear later in the input. Echo embeddings resolve this issue by repeating the input twice in the input to the embedding model. Our method has strong performance on MTEB and is compatible with many other methods for improving embedding models.
Range-Agnostic Multi-View Depth Estimation With Keyframe Selection. Multi-View 3D reconstruction techniques process a set of source views and a reference view to yield an estimated depth map for the latter.
ChatMusician: Understanding and Generating Music Intrinsically with LLM. Adding a modality-specific encoder to a language model is usually necessary for comprehending music. This is unstable and costly. This study demonstrated that tokenizing music into ABC notation significantly boosted music knowledge without affecting basic language proficiency.
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs. Bytedance has produced a system called MegaScale that can be used to train massively parallel large language models. It succeeded in training a 175B LLM on 12,288 GPUs with 55.2% Model FLOP utilization (MFU), which is extremely impressive. Bytedance plans to open source some aspects of the codebase.
ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval. ListT5 presents a novel reranking technique that not only increases information retrieval precision but also provides a workable solution to the issues that earlier listwise rerankers encountered.
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands.
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention. A novel method called IR-QLoRA improves quantized big language model accuracy, which makes them more appropriate for usage on low-resource devices.
Video as the New Language for Real-World Decision Making. An incredible research that presents video as a possible improvement over current methods for AI to communicate with humans. It demonstrates the usage of video models as environment simulators, planners, agents, and computation engines.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. A parameter in the majority of language models is represented by 16 bits or more. This produces strong models that may be costly to operate. This study suggests a technique where each parameter is in {-1, 0, 1} and requires 1.58 bits. Performance is precisely matched by this approach up to 3B parameters. Models and codes are not yet available.
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models. Enhancing multi-modality foundation models such as GPT-4V in low-level visual perception tasks is the main goal of this research. The extensive study collected comments on 18,973 photos from 58,000 people and produced the Q-Pathway dataset for brightness, color, and clarity analysis.
Graph Diffusion Policy Optimization. The primary objective of this work is to improve multi-modality foundation models, like GPT-4V, in low-level visual perception tasks. The comprehensive study created the Q-Pathway dataset for brightness, color, and clarity analysis by gathering feedback on 18,973 photographs from 58,000 users.
HiGPT: Heterogeneous Graph Language Model. A method for learning across many heterogeneous graphs without requiring fine-tuning is called HiGPT. It excels at adapting to different data distributions thanks to its integration with a unique graph tokenizer and a large corpus of graph commands.
PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning. PromptMM uses Multi-modal Knowledge Distillation to enhance recommendation systems on sites like Amazon and TikTok. In order to avoid overfitting, it eliminates errors in user preferences and streamlines systems by extracting key characteristics from different kinds of content (textual, audio, or visual).
Genie: Generative Interactive Environments. We introduce Genie, a foundation world model trained from Internet videos that can generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.
UniVS: Unified and Universal Video Segmentation with Prompts as Queries. With a unique prompt-based methodology, UniVS is a unified architecture for video segmentation that addresses the difficulties of diverse segmentation jobs. UniVS removes the requirement for heuristic inter-frame matching by utilizing prompt characteristics as queries and providing a target-wise prompt cross-attention layer. This allows UniVS to adapt to various video segmentation settings with ease.
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis. With a deep semantic knowledge of pictures, the Coarse-to-Fine Latent Diffusion (CFLD) method avoids overfitting and offers a novel Pose-Guided Person Image Synthesis technique that overcomes the drawbacks of previous models.
Evaluating Quantized Large Language Models. Large language models like OPT and LLaMA2 can be rendered more compute- and memory-efficient through the use of post-training quantization.
Representing 3D sparse map points and lines for camera relocalization. With minimal memory and processing power, this study presents a novel method for 3D mapping and localization that processes both point and line information using a lightweight neural network, greatly improving pose accuracy.
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. Drive-WM can produce high-quality multiview films to forecast future events, allowing self-driving cars to make more intelligent and safe driving choices.
Do Large Language Models Latently Perform Multi-Hop Reasoning?. This study delves into the fascinating world of Large Language Models (LLMs) and their ability to engage in multi-hop reasoning, akin to human thought processes. By crafting intricate prompts like "The mother of the singer of 'Superstition' is", researchers probe how LLMs navigate complex queries. They uncover compelling evidence suggesting that these models can indeed perform multi-hop reasoning, often relying on a bridge entity like Stevie Wonder to connect disparate pieces of information. The findings highlight both the strengths and limitations of LLMs in this regard, offering valuable insights for their future development and application.

News

Link description
Microsoft reportedly makes AI server gear to cut Nvidia dependence. Microsoft is creating its own AI server hardware to intensify actions to decrease its dependency on Nvidia, according to a source familiar with the matter speaking to The Information.
‘Embarrassing and wrong’: Google admits it lost control of image-generating AI. Google has apologized (or come very close to apologizing) for another embarrassing AI blunder this week, an image-generating model that injected diversity into pictures with a farcical disregard for historical context. While the underlying issue is perfectly understandable, Google blames the model for “becoming” oversensitive.
Is OpenAI the next challenger trying to take on Google Search? The Information says OpenAI is working on web search (partially powered by Bing) that would more directly compete with Google. It’s unclear if it would be standalone, or a part of ChatGPT.
Transformer Circuits Thread - Updates - February 2024. The research experts at Anthropic have been developing a Circuit-based approach to comprehend deep neural networks. These circuits seek to pinpoint model components that are employed in particular applications. Every month, the research team publishes an update on the trials they conducted and the
A new tool targets voter fraud in Georgia – but is it skirting the law?. A tech company supported by Trump’s former lawyer is injecting chaos into the state’s vote-counting process
Democratic political operative admits he commissioned robocall of AI Biden. Steve Kramer said ‘easy-to-use technology’ enabled him to send automated call while New Orleans magician says he was paid $150 to make it
Mistral Large. Mistral Large is our new cutting-edge text generation model. It reaches top-tier reasoning capabilities. It can be used for complex multilingual reasoning tasks, including text understanding, transformation, and code generation. Mistral Large achieves strong results on commonly used benchmarks, making it the world's second-ranked model generally available through an API (next to GPT-4)
Scale AI to set the Pentagon’s path for testing and evaluating large language models . The company will create a comprehensive T&E framework for generative AI within the Defense Department.
DatologyAI is building tech to automatically curate AI training datasets. Morcos’ company, DatologyAI, builds tooling to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini and other like GenAI models. The platform can identify which data is most important depending on a model’s application (e.g. writing emails), Morcos claims, in addition to ways the dataset can be augmented with additional data and how it should be batched, or divided into more manageable chunks, during model training.
Bay Bridge: A supercomputer built for startups. With flexible short-term renting options, San Francisco Compute Company is now providing the lowest-cost H100 training clusters in the world to customers who require intensive computing for AI model training but do not want to commit to long-term agreements. Its first cluster, Angel Island, is operational at the moment, and Bay Bridge will follow shortly. The unique business strategy of SF Compute places a premium on cost and accessibility for AI entrepreneurs without requiring long-term commitments.
mlabonne/AlphaMonarch-7B. AlphaMonarch-7B is a new DPO merge that retains all the reasoning abilities of the very best merges and significantly improves its conversational abilities. Kind of the best of both worlds in a 7B model.
LazyAxolotl. This notebook allows you to fine-tune your LLMs using Axolotl and Runpod
Apple’s electric car project is dead. After a decade of work, the company is reportedly giving up on its ambitious effort to create an autonomous electric car.
Expressive Whole-Body Control for Humanoid Robots. UCSD researchers trained robust, socially inclined, expressive policies for humanoid robots. Their unchoreographed dancing on grass videos is quite amazing.
Meta plans launch of new AI language model Llama 3 in July, The Information reports. Meta Platforms (META.O), opens new tab is planning to release the newest version of its artificial-intelligence large language model Llama 3 in July which would give better responses to contentious questions posed by users, The Information reported on Wednesday.
Tim Cook Says Apple Will 'Break New Ground' in Generative AI. Cook said that the company will "break new ground" in generative AI in 2024. "We believe it will unlock transformative opportunities for our users," said Cook.
Elon Musk sues OpenAI accusing it of putting profit before humanity. Lawsuit says chief executive Sam Altman’s deal with Microsoft has broken organization’s mission
Figure raises $675M at $2.6B valuation. In order to continue developing humanoid robots, Figure, a robotics startup, has secured $675 million from a number of significant investors, including OpenAI.

Resources

Link description
Pearl - A Production-ready Reinforcement Learning AI Agent Library. Pearl is a new production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. Pearl enables to development Reinforcement Learning AI agents.
Large Language Models for Data Annotation: A Survey. This is a curated list of papers about LLM for Annotation
Automotive Object Detection with Spiking Neural Networks (SNNs). One novel and effective model for autonomous cars is Spiking Neural Networks. High performance is attained using up to 85% less energy.
Berkeley function calling leaderboard. When a language model can access resources through synthesized functions to carry out commands, this is known as function calling. To pass to such functions, the parameters must be properly synthesized. The purpose of this leaderboard is to evaluate the model's performance on function-calling tasks.
FuseChat. FuseChat is a novel approach to combine the advantages of many huge language models into a single, more potent model without having to pay expensive training fees again.
ShieldLM . ShieldLM is a bilingual (Chinese and English) safety detector that mainly aims to help detect safety issues in LLMs' generations. It aligns with general human safety standards, supports fine-grained customizable detection rules, and provides explanations for its decisions.
Enable decision-making based on LLM-based simulations. An open-source project called Simulatrex is dedicated to GABM or generative agent-based modeling. Large language models are used to provide more accurate simulations.
Training-Free Long-Context Scaling of Large Language Models. Dual chunk attention is a training-free and effective method for extending the context window of large language models (LLMs) to more than 8x times their original pre-training length. We refer to the Llama-based model with dual chunk attention as ChunkLlama.
DPO to encourage descriptiveness. A minimal code set up with TRL to tune a model to be more descriptive.
Shape suffixes for ML coding. The readable nature of shapes in tensors is significantly enhanced by a coding style at Character AI.
Getting started with MAX Developer Edition. To drastically reduce complexity and accelerate AI implementations, Modular developed the MAX toolset. It is currently accessible.
Bonito. Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.
Awesome-LLMs-for-Video-Understanding. A selection of helpful resources for comprehending videos with huge language models can be found in this repository.
Mist text to speech. A new text-to-speech technology called Rime has strong conversational capabilities. This model may incorporate "ums" and realistic pauses, in contrast to earlier ones.
Add your own Ollama models. Guidelines for contributing your own models to the Ollama repository for public usage.
2x speed up HF inference with static KV Cache. Increased inference speed can lead to new use cases. This code proposes a method to accelerate Hugging Face inference using Llama models.

Perspectives

Link description
Sam Altman Wants $7 Trillion. In order to meet the fast-rising costs of developing generative AI models such as GPT, Sam Altman has proposed a $7 trillion budget, indicating an exponential increase in resources required for further iterations. This goal highlights a critical juncture in the development of AI, striking a balance between the quickening pace of scientific improvement and its wider effects on safety and societal preparedness.
Ten AI Insights from Databricks, Anyscale, and Microsoft. This article features interviews with founders of AI-forward firms, including their perspectives on the emergence of artificial intelligence (AGI), how to approach LLMs, and basic strategies for entrepreneurs integrating AI into their products.
What the EU’s tough AI law means for research and ChatGPT. The EU AI Act is the world’s first major legislation on artificial intelligence and strictly regulates general-purpose models.
Online images amplify gender bias. We find that gender bias is consistently more prevalent in images than text for both female- and male-typed categories. We also show that the documented underrepresentation of women online is substantially worse in images than in text, public opinion, and US census data.
ChunkLlama. Dual chunk attention is a training-free and effective method for extending the context window of large language models (LLMs) to more than 8x times their original pre-training length. We refer to the Llama-based model with dual chunk attention as ChunkLlama.
distilabel. AI Feedback (AIF) framework for building datasets with and for LLMs.
StarCoder2. StarCoder2-15B model is a 15B parameter model trained on 600+ programming languages from The Stack v2, with opt-out requests excluded.
The paradox of diffusion distillation. Diffusion models decompose complex issues, such as image production, into numerous smaller issues, such as minimizing a small amount of noise in an image. Single-step diffusion generation has received a lot of attention, however it appears to miss the mark. This article examines the diffusion distillation conundrum and lists the various avenues of inquiry that might be pursued.

meme-of-the-week

Back to index

ML news: Week 19 - 25 February

Research

Link description
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning. Deciding which examples to employ when aligning language models with preference data is frequently difficult. This paper proposes an unexpectedly strong baseline: pick the 1,000 longest cases.
Extreme Video Compression with Pre-trained Diffusion Models. As diffusion models get more adept at synthesizing pictures and videos, they may be used for other purposes due to their extensive "knowledge" of the world. This study discovered an astounding 0.02 bits per pixel reduction. The secret here was to track perceptual similarities along the route and deliver a new frame of the original movie as necessary.
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset. To train open-source Large Language Models in math that equal the performance of closed-source models, researchers have developed a new dataset called OpenMathInstruct-1. With 1.8 million problem-solution pairings, this innovation paves the way for more competitive and approachable AI systems for math teaching.
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. A feature of the Transformer design that allows it to consume less memory at inference time is the quantization of the KV cache. The process of decreasing floating point accuracy with the least amount of quality loss is called quantization.
Pushing the Limits of Zero-shot End-to-End Speech Translation. ZeroSwot is a novel approach to voice Translation (ST) that addresses the data scarcity and distinctions between text and voice. It may operate with a multilingual translation model by using special strategies to train a voice encoder using only speech recognition data.
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE). A novel technique called SpLiCE simplifies the complicated visual data in CLIP.
TDViT: Temporal Dilated Video Transformer for Dense Video Tasks. A novel Temporal Dilated Video Transformer (TDViT) has been created to enhance the analysis of tasks involving dense videos, like object detection in videos frame by frame.
Generative Representational Instruction Tuning. A model that creates embeddings and text has been trained and released by the Contextual team. It performs noticeably better than a single specialized model. With embedding as the output modality, the model offers an intriguing interpretation of the multi-modal trend.
LoRA+: Efficient Low-Rank Adaptation of Large Models. To improve on the current Low-Rank Adaptation (LoRA) technique for fine-tuning big models, this work introduces LoRA+. By applying multiple learning rates for important process components, LoRA+ achieves improved performance and faster fine-tuning without raising processing loads.
GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting. We propose GaussianObject, a framework to represent and render the 3D object with Gaussian splatting, that achieves high rendering quality with only 4 input images.
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single to Sparse-view 3D Object Reconstruction. This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses.
ChatterBox: Multi-round Multimodal Referring and Grounding. A vision-language model called ChatterBox performs exceptionally well in multimodal dialogues, particularly in the recently defined job of multimodal multi-round referring and grounding.
Large language models streamline automated machine learning for clinical studies. A knowledge gap persists between machine learning developers and clinicians. Here, the authors show that the Advanced Data Analysis extension of ChatGPT could bridge this gap and simplify complex data analyses, making them more accessible to clinicians.
Extracting accurate materials data from research papers with conversational language models and prompt engineering. Efficient data extraction from research papers accelerates science and engineering. Here, the authors develop an automated approach that uses conversational large language models to achieve high precision and recall in extracting materials data.
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis. GradSafe is a novel technique that can identify dangerous prompts in big language models without requiring a lot of training. Compared to existing approaches, it can identify dangerous prompts more accurately by examining the gradients of certain parameters.
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition. A novel technique called Class-Aware Mask-guided (CAM) feature refinement improves text recognition in challenging environments.
Object Recognition as Next Token Prediction. an innovative approach to object recognition that makes use of a language decoder. With this approach, text tokens are predicted from picture embeddings by using a customized non-causal attention mask. It makes it possible to sample many labels in parallel effectively.
TIER: Text and Image Encoder-based Regression for AIGC Image Quality Assessment. To evaluate the quality of the generated images, TIER makes use of both written prompts and the images that result from them.
Large Language Models for Data Annotation: A Survey. a taxonomy of techniques that use LLMs for data annotation; covers three aspects: LLM-based data annotation, evaluating LLM-generated annotations, and learning using LLM-generated annotations. an overview and a decent list of references that utilize LLMs for data annotation.
Generative Representational Instruction Tuning. provides new state-of-the-art on MTEB and the unification is reported to speed up RAG by 60% for long documents. It does this by using generative representational instruction tuning, in which an LLM is trained to perform both generative and embedding tasks and designed to distinguish between them via the instructions.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. demonstrates that a more straightforward version of REINFORCE performs better than both PPO and recently suggested alternatives like DPO and RAFT; all in all, it demonstrates that online RL optimization may be advantageous and inexpensive. Moreover, it demonstrates that many of the components of PPO are superfluous in an RLHF setting.
In Search of Needles in an 11M Haystack: Recurrent Memory Finds What LLMs Miss. finds that both GPT-4 and RAG performance significantly depend on the first 25% of the input, suggesting that there is room for improved context processing mechanisms. It also reports that recurrent memory augmentation of transformer models achieves superior performance on documents of up to 10 million tokens. This study investigates the capability of transformer-based models in extremely long context processing.
When is Tree Search Useful for LLM Planning? It Depends on the Discriminator. examines the methods used by LLM to solve multi-step issues using a framework that includes a generator, discriminator, and planner technique (such as tree search and iterative correction); reveals that, although existing LLMs do not exhibit these discrimination skills, planning approaches require discriminators with at least 90% accuracy; furthermore, it is shown that, although tree search performs well, it is at least 10–20 times slower and therefore unsuitable for real-world applications.
Chain-of-Thought Reasoning Without Prompting. claims to significantly improve a model's reasoning capabilities over greedy decoding across reasoning benchmarks and suggests a chain-of-thought (CoT) decoding method to elicit reasoning capabilities from pre-trained LLMs without explicit prompting. It also finds that the presence of CoT in the decoding path increases the model's confidence in its final answer.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. a set of free and open-source tools for writing, running, and improving code iteratively; includes 68K multi-turn interaction dataset; combines human input with code execution for dynamic code refinement; achieves excellent results on benchmarks such as HumalEval and EvalPlus.

News

Link description
Anthropic takes steps to prevent election misinformation. Called Prompt Shield, the technology, which relies on a combination of AI detection models and rules, shows a pop-up if a U.S.-based user of Claude, Anthropic’s chatbot, asks for voting information. The pop-up offers to redirect the user to TurboVote, a resource from the nonpartisan organization Democracy Works, where they can find up-to-date, accurate voting information.
OpenAI's next AI product could be after your job (again). OpenAI is said to be developing AI agents that automate even more complex tasks, though their launch timeline remains unknown. One AI agent is said to take over the customer’s device to perform tasks like transferring data from a document to a spreadsheet, filling out expense reports, and entering them into accounting software. The other AI agent is said to perform more research-oriented, web-based tasks, such as creating itineraries and booking flight tickets.
Our next-generation model: Gemini 1.5. In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across several dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra while using less computing.
OpenAI on track to hit $2bn revenue milestone as growth rockets. Thanks in large part to ChatGPT's enormous success, OpenAI has reached an annual revenue run rate of over $2 billion, making it one of the fastest-growing tech companies.
Sam Altman wants Washington's backing for his $7 trillion AI chip venture. The OpenAI CEO is working to secure US government approval for the project as it risks raising national security and antitrust concerns, Bloomberg reported.
‘Gemini Business’ and ‘Gemini Enterprise’ plans for Google Workspace are coming. The upcoming changelog — as spotted by Testing Catalog and Dylan Roussel on X/Twitter today — reveals the existence of “Gemini Business” and “Gemini Enterprise” plans. This will give “Google Workspace customers access to one of Google’s most capable Al models, 1.0 Ultra in Gemini, and enterprise-grade data protections.”
OpenAI Reaches $80 Billion Valuation In Venture Firm Deal, Report Says. OpenAI inked a deal with venture capital firm Thrive Capital that boosted its valuation to $80 billion or more, the New York Times reported, a nearly threefold increase in value from just nine months ago.
Magic raises $117m to continue code generation models. We've raised $117M to build an AI software engineer.
SoftBank Founder Masayoshi Son Aims to Raise $100 Billion for New Chip Venture, "Izanagi". Masayoshi Son, the visionary founder of SoftBank Group Corp., has set his sights on revolutionizing the semiconductor industry with the launch of Izanagi, a groundbreaking chip venture backed by a staggering $100 billion investment.
Scribe $25M Series B. To further its AI-driven platform, Scribe has secured a Series B fundraising round headed by Redpoint Ventures. This round aims to speed up the generation of visual step-by-step tutorials and enable knowledge exchange between enterprises.
Amazon AGI Team Say Their AI Is Showing “Emergent Abilities”. "Big Adaptive Streamable TTS with Emergent Abilities" (BASE TTS), a language model created by Amazon AGI researchers, exhibits "state-of-the-art naturalness" in conversational text and demonstrates language skills that it wasn't particularly trained on.
Gemma: Introducing new state-of-the-art open models. We’re releasing model weights in two sizes: Gemma 2B and Gemma 7B. Each size is released with pre-trained and instruction-tuned variants. Ready-to-use Colab and Kaggle notebooks, alongside integration with popular tools such as Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM, make it easy to get started with Gemma.
Reddit has a new AI training deal to sell user content. Over a decade of valuable user content is now for sale as Reddit preps to go public.
Apple Developing AI Tool to Help Developers Write Code for Apps. Apple is working on an updated version of Xcode that will include an AI tool for generating code, reports Bloomberg. The AI tool will be similar to GitHub Copilot from Microsoft, which can generate code based on natural language requests and convert code from one programming language to another.
Stable Diffusion 3. Announcing Stable Diffusion 3 in early preview, our most capable text-to-image model with greatly improved performance in multi-subject prompts, image quality, and spelling abilities.
How Bret Taylor’s new company is rethinking customer experience in the age of AI. The two founders fundamentally see AI agents as a new technology category, providing an entirely new way for customers to interact with brands to improve their overall experience.
Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster. We're excited to announce Phind-70B, our largest and most performant model to date. Running at up to 80 tokens per second, Phind-70B gives high-quality answers for technical topics without making users make a cup of coffee while they wait. Phind-70B scores 82.3% on HumanEval, beating the latest GPT-4 Turbo (gpt-4-0125-preview) score of 81.1% in our evaluation.
Marqo Raises $12.5 Million to Help Businesses Build Generative AI Applications. Marqo has raised $12.5 million in a Series A funding round to advance the adoption of its search platform that helps businesses build generative artificial intelligence (AI) applications that are more relevant and up to date.

Resources

Link description
minbpe. Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
GPTScript. GPTScript is a new scripting language to automate your interaction with a Large Language Model (LLM), namely OpenAI. The ultimate goal is to create a fully natural language-based programming experience. The syntax of GPTScript is largely natural language, making it very easy to learn and use.
QWEN. We opensource our Qwen series, now including Qwen, the base language models, namely Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B, as well as Qwen-Chat, the chat models, namely Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, and Qwen-72B-Chat.
Sora Reference Papers. A collection of all papers referenced in OpenAI's "Video generation models as world simulators"
repeng. Control vectors are a low-cost means of controlling the output of semantic generation. Compared to LoRA, they are less expensive to train yet may still be fairly powerful. It's made simpler with this library.
OpenRLHF. This is a Ray-based implementation of RLHF for Mistral and other Llama-style models. Several PPO stabilizing techniques are included to enhance performance.
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations. To enhance robot manipulation, the 3D Diffuser Actor blends 3D scene representations with diffusion strategies. Robots are better able to comprehend and engage with their surroundings thanks to this AI-driven method.
How to jointly tune learning rate and weight decay for AdamW. AdamW is often considered a method that decouples weight decay and learning rate. In this blog post, we show that this is not true for the specific way AdamW is implemented in Pytorch. We also show how to adapt the tuning strategy to fix this: when doubling the learning rate, the weight decay should be halved.
OpenLLMetry-JS. OpenLLMetry-JS is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. Because it uses OpenTelemetry under the hood, it can be connected to your existing observability solutions - Datadog, Honeycomb, and others.
List of GPU clusters for rent. a list of entire clusters that can be rented on an hourly basis.
Mamba: The Hard Way. A detailed description of how Mamba works
new benchmark for large language models. It's a collection of nearly 100 tests I've extracted from my actual conversation history with various LLMs.
BoCoEL. Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) is 10 times faster with just a few lines of modular code.
FiT: Flexible Vision Transformer for Diffusion Model. This repo contains PyTorch model definitions, pre-trained weights, and sampling code for our flexible vision transformer (FiT). FiT is a diffusion transformer-based model that can generate images at unrestricted resolutions and aspect ratios.
RobustVLM. To defend multi-modal models like OpenFlamingo and LLaVA against visual adversarial assaults, a novel technique is presented in this study. The authors successfully defend these models against manipulative picture assaults by fine-tuning the CLIP visual encoder in an unsupervised way, increasing the models' dependability and security in practical applications without requiring complete model retraining.
HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings. A popular benchmark called Holistic Evaluation of Language Models (HELM) was issued by the Stanford language modeling group. Additionally, they created HELM-Instruct, a version for instruction following. It is absolute, open-ended, and multifaceted.
LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4. We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4-15%, depending on the task. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.
Multimodal LLM’s Ability to Understand Visual Data. A new tool called ChartX is designed to assess how well multi-modal large language models (MLLMs) can understand and make sense of visual charts.
A Critical Evaluation of AI Feedback for Aligning Language Models. The efficacy of integrating reinforcement learning with supervised fine-tuning in training is questioned in this repository. The more involved two-step technique can be outperformed by first training with a more sophisticated model, such as GPT-4.
MMCSG Dataset. The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.
MultiLora inference server. One base model can have many LoRAs hot-swapped onto it using the Lorax inference server. This allows a large variety of model tunes to be supported with a significant reduction in RAM use.
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. GTBench is a language-driven environment, evaluating the strategic reasoning limitations of LLMs through game-theoretic tasks. GTBench is built on top of OpenSpiel, supporting 10 widely-recognized games
CrewAI. A library called CrewAI is available for creating and managing AI agents that make use of Replit and LangChain. It offers an easy-to-integrate modular setup comprising tasks, agents, crews, and tools for a variety of applications. LangSmith improves performance insights into non-deterministic LLM calls while streamlining the debugging process.
gemma.cpp. gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.
MMedLM. The official codes for "Towards Building Multilingual Language Model for Medicine".
LLM Evaluation Metrics for Labeled Data. How to measure the performance of LLM applications with ground truth data.

Perspectives

Link description
The data revolution in venture capital. Investors, data scientists, and tool builders leading the data-driven future of venture capital.
The Three C's: Creativity, Collaboration, and Communication. The way we communicate, work together, and complete creative projects has changed significantly since the invention of computing. With AI, we're beginning to witness the commencement of another significant change. We undervalue how significant this change will be. Businesses that integrate artificial intelligence (AI) into their products from the start will have a significant edge over those who add it later to already-existing goods.
Inside OpenAI Logan Kilpatrick (head of developer relations). Have you ever wondered how OpenAI develops and innovates so quickly? The head of developer relations at OpenAI, Logan Kilpatrick, talks about the company's decision-making structure for product launches, high agency and urgency, and OpenAI's distinct culture in this podcast.
Mind-reading devices are revealing the brain’s secrets. Implants and other technologies that decode neural activity can restore people’s abilities to move and speak — and help researchers understand how the brain works.
Generative AI’s environmental costs are soaring — and mostly secret. First-of-its-kind US bill would address the environmental costs of the technology, but there’s a long way to go.
Strategies for an Accelerating Future. With Google's Gemini providing a context window of over a million tokens and Groq's hardware enabling almost instantaneous responses from GPT-3.5 models, among other recent advancements in AI, these represent a significant advancement in practical AI applications and highlight the pressing need for leaders to comprehend and adjust to the rapidly changing AI landscape.
How to lose at Generative AI! Despite its excitement, generative AI is likely to let most startups down since it benefits established players with data advantages, established workflows, and the capacity to integrate AI without requiring significant system changes. A difficult road lies ahead for startups hoping to make a significant impact in the Generative AI space, even in spite of venture capital flooding the space. These startups are essentially preparing the market for incumbents who can readily adopt and integrate AI innovations into their dominant platforms by concentrating on expeditious engineering and UX improvements at the workflow layer.
Stockholm declaration on AI ethics: why others should sign. The use of artificial intelligence (AI) in science has the potential to do both harm and good. As a step towards preventing the harm, we have prepared the Stockholm Declaration on AI for Science.
This is why the idea that AI will just augment jobs, never replace them, is a lie!. AI will automate labor in certain areas. The response thus far has been divided: would increased efficiency allow for more human workers to accomplish the same duties, or will fewer workers be needed? This article compares and contrasts the effects of technology on manufacturing, agriculture, and the contemporary knowledge worker.
LLM evaluation at scale with the NeurIPS Large Language Model Efficiency Challenge. ‍After a year of breakneck innovation and hype in the AI space, we have now moved sufficiently beyond the peak of the hype cycle to start asking a critical question: are LLMs good enough yet to solve all of the business and societal challenges we are setting them up for?

meme-of-the-week

Back to index

ML news: Week 12 - 18 February

Research

Link description
Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills. It has so far proven difficult to transfer expertise amongst RL agents. An environment-neutral skill set is optimized for this work. Its generalization performance is encouraging.
Self-Play Fine-Tuning (SPIN). We propose a new fine-tuning method called Self-Play fine-tuning (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data.
Real-World Fluid Directed Rigid Body Control via Deep Reinforcement Learning. "Box o Flows" addresses the difficulty of replicating complicated fluid dynamics for reinforcement learning (RL) applications by introducing a unique experimental system for testing RL algorithms in dynamic real-world environments. It demonstrates how model-free reinforcement learning algorithms may produce complex behaviors from simple rewards, improve data efficiency through offline reinforcement learning, and open the door to more widespread RL use in complex systems.
WebLINX. A collection of 100,000 web-based conversations in conversational format is called Weblinx. It was made available to advance research on web-based navigation guided by language models.
ImplicitDeepfake: Plausible Face-Swapping through Implicit Deepfake Generation using NeRF and Gaussian Splatting. To produce incredibly lifelike 3D avatars, this work presents ImplicitDeepfake1, a novel method that blends deepfake technology with Gaussian Splatting (GS) and Neural Radiance Fields (NeRFs).
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. Researchers have created a novel method to improve language models' mathematical proficiency by letting base models choose excellent mathematical information on their own.
Complete Instances Mining for Weakly Supervised Instance Segmentation. A novel method for image segmentation has been presented by researchers that uses just simple image labels to identify particular portions of a picture, such as a dog. They overcame the difficulty of a network identifying many occurrences of the same object by presenting an innovative technique that improves efficiency and lowers mistake rates.
Whispers in the Machine: Confidentiality in LLM-integrated Systems. The increased pairing of huge language models with external technologies has given rise to new vulnerabilities associated with data breaches. This research offers a methodical way to assess various AI systems' privacy protection efficacy.
This AI learned the language by seeing the world through a baby’s eyes. An artificial intelligence (AI) model has learned to recognize words such as ‘crib’ and ‘ball’, by studying headcam recordings of a tiny fraction of a single baby’s life. original article.
World Model on Million-Length Video and Language with RingAttention. This model can correctly respond to queries with a million token video duration using ring attention and an optimized 7B parameter model. It performs exceptionally accurately on retrieval benchmarks and beats commercial VLMs.
LUMIERE - A Space-Time Diffusion Model for Video Generation. A new text-to-video model from Google can assist in accepting input in the form of images and styles. It diffuses everything simultaneously via a brand-new "space-time UNet."
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction. With the help of textual descriptions, SEINE is a novel video diffusion model that can expand short AI-generated video clips into larger, narrative-level segments with smooth and creative scene transitions.
Text-Driven Image Editing via Learnable Regions. Given an input image and a language description for editing, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context. Our method can also handle multiple-object and long-paragraph scenarios.
Video annotator. The annotation process directly incorporates subject experts thanks to the Video Annotator framework. This novel method increases the accuracy and efficiency of the model by combining human expertise with zero-shot and active learning techniques.
Automated Unit Test Improvement using Large Language Models at Meta. Meta created tests for its code base using massive language models. It discovered significant gains in overall code quality and test coverage.
Meta’s V-JEPA model. According to Yann LeCun, VP and Chief AI Scientist at Meta, more data-efficient self-supervised models are required for general intelligence. This approach, which uses models trained on video to comprehend parts of the world, is a first step in that direction. The models can be accessed by the general public.
Extreme Video Compression with Pre-trained Diffusion Models. Diffusion models have been used by researchers to create a novel video compression technique that produces high-quality video frames at low data rates.

News

Link description
Laion releases assistant BUD-E. An open assistant that runs on a gaming laptop and utilizes highly optimized language models and natural voice has been made available by the Laion research group. The project's goal is to offer a capable, low-resource personal assistant that is simple to deploy.
OpenAI Hits $2 Billion Revenue Milestone. Microsoft-backed OpenAI hit the $2 billion revenue milestone in December. The company's annualized revenue topped $1.6 billion in December based on strong growth from its ChatGPT product, up from $1.3 billion as of mid-October, the Information had reported previously.
AI PCs will make up nearly 60% of total PC shipments by 2027. Demand for AI PCs to start ramping up this year
The first human received an implant from Neuralink yesterday and is recovering well. Initial results show promising neuron spike detection.
Reka Flash: An Efficient and Capable Multimodal Language Model. Reka Flash is a state-of-the-art 21B model trained entirely from scratch and pushed to its absolute limits. It serves as the “turbo-class” offering in our lineup of models. Reka Flash rivals the performance of many significantly larger models, making it an excellent choice for fast workloads that require high quality. On a myriad of language and vision benchmarks, it is competitive with Gemini Pro and GPT-3.5.
Apple releases ‘MGIE’, a revolutionary AI model for instruction-based image editing. Apple has released a new open-source AI model, called “MGIE,” that can edit images based on natural language instructions. MGIE, which stands for MLLM-Guided Image Editing, leverages multimodal large language models (MLLMs) to interpret user commands and perform pixel-level manipulations. The model can handle various editing aspects, such as Photoshop-style modification, global photo optimization, and local editing.
DeepMind framework offers a breakthrough in LLMs’ reasoning. A breakthrough approach in enhancing the reasoning abilities of large language models (LLMs) has been unveiled by researchers from Google DeepMind and the University of Southern California. Their new ‘SELF-DISCOVER’ prompting framework – published this week on arXiV and Hugging Face – represents a significant leap beyond existing techniques, potentially revolutionizing the performance of leading models such as OpenAI’s GPT-4 and Google’s PaLM 2.
Meta will start detecting and labeling AI-generated images from other companies. The feature will arrive on Facebook, Instagram, and Threads in the coming months
Stability and Wurstchen release new text-to-image model. a new text-to-image model building upon the Würstchen architecture. Stable Cascade is exceptionally easy to train and finetune on consumer hardware thanks to its three-stage approach. In addition to providing checkpoints and inference scripts, we are releasing scripts for finetuning, ControlNet, and LoRA training to enable users further to experiment with this new architecture that can be found on the Stability GitHub page.
Memory and new controls for ChatGPT. OpenAI is testing a new feature that allows ChatGPT to remember facts across conversations. This can be switched off if desired. It will allow for a higher measure of personalization when interacting with the chat system.
Report: Sam Altman seeking trillions for AI chip fabrication from UAE, others. On Thursday, The Wall Street Journal reported that OpenAI CEO Sam Altman is in talks with investors to raise as much as $5 trillion to $7 trillion for AI chip manufacturing, according to people familiar with the matter. The funding seeks to address the scarcity of graphics processing units (GPUs) crucial for training and running large language models like those that power ChatGPT, Microsoft Copilot, and Google Gemini.
Meta to deploy in-house custom chips this year to power AI drive. Facebook owner Meta Platforms plans to deploy into its data centers this year a new version of a custom chip aimed at supporting its artificial intelligence (AI) push, according to an internal company document seen by Reuters on Thursday.
Google Launches €25 Million AI Opportunity Initiative for Skills Training Across Europe. By investing in AI literacy, infrastructure, and partnerships across sectors, the company hopes to empower broad segments of the workforce with valuable future-proof skills.
The brain area that lights up in prickly people. Those who are quick to take offense show similar levels of activity in a region of the brain that’s crucial for decision-making.
Disrupting malicious uses of AI by state-affiliated threat actors. OpenAI discovered and terminated accounts affiliated with nation-states using GPT models for malicious cases.
Andrej Karpathy is leaving OpenAI again — but he says there was no drama. Andrej Karpathy, a widely respected research scientist, announced today that he has left OpenAI. This is the second time Karpathy has left the top AI firm and his departure is not because of any event, issue, or drama, he said.
NVIDIA’s new AI chatbot runs locally on your PC. NVIDIA just released a free demo version of a chatbot that runs locally on your PC. This is pretty neat, as it gives the chatbot access to your files and documents. You can feed Chat with RTX a selection of personal data and have it create summaries based on that information. You can also ask it questions, just like any chatbot, and dive into your data for answers.
MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer. Facebook unveiled an advanced open-source audio model that is 7 times quicker than competing models without compromising on quality. It can produce sound effects and music. The manuscript is now accessible.
MIMIR. Python package for measuring memorization in LLMs.
Nvidia is now worth as much as the whole Chinese stock market. Nvidia is now worth the same as the whole Chinese stock market as defined by Hong Kong-listed H-shares, Bank of America chief investment strategist Michael Hartnett pointed out in a new note. The company's market cap has hit $1.7 trillion, the same as all Chinese companies listed on the Hong Kong Stock Exchange. Nvidia's stock soared 239% in 2023 and is up 41% in 2024, through Thursday.
OpenAI Sora. A new video-generating model with amazing quality was revealed by OpenAI. Red teamers are allowed to test it right now.
Lambda Raises $320M To Build A GPU Cloud For AI. Lambda’s mission is to build the #1 AI compute platform in the world. To accomplish this, we’ll need lots of NVIDIA GPUs, ultra-fast networking, lots of data center space, and lots of great new software to delight you and your AI engineering team.
USPTO says AI models can’t hold patents. The United States Patent and Trademark Office (USPTO) published guidance on inventorship for AI-assisted inventions, clarifying that while AI systems can play a role in the creative process, only natural persons (human beings) who make significant contributions to the conception of an invention can be named as inventors. It also rules out using AI models to churn out patent ideas without significant human input.

Resources

Link description
RLX: Reinforcement Learning with MLX. RLX is a collection of Reinforcement Learning algorithms implemented based on the implementations from CleanRL in MLX, Apple's new Machine Learning framework.
llmware. llmware is a unified framework for developing LLM-based application patterns including Retrieval Augmented Generation (RAG). This project provides an integrated set of tools that anyone can use - from a beginner to the most sophisticated AI developer - to rapidly build industrial-grade, knowledge-based enterprise LLM applications with a specific focus on making it easy to integrate open-source small specialized models and connecting enterprise knowledge safely and securely.
Point Transformer V3. For processing 3D point clouds, the Point Transformer V3 (PTv3) model is an effective and straightforward paradigm. By putting more of an emphasis on efficiency and scaling up than on fine-grained design details, it can attain quicker processing speeds and improved memory economy.
phidata. Phidata is a toolkit for building AI Assistants using function calls. Function calling enables LLMs to achieve tasks by calling functions and intelligently choosing their next step based on the response, just like how humans solve problems.
ml-mgie. Apple released code that uses multimodal language models to improve human-provided natural language edits to images.
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting. Lag-Llama is the first open-source foundation model for time series forecasting!
Learning to Fly in Seconds. This repository contains the code for the paper Learning to Fly in Seconds. It allows to train end-to-end control policies using deep reinforcement learning. The training is done in simulation and is finished within seconds on a consumer-grade laptop. The trained policies generalize and can be deployed on real quadrotors
Packing Inputs Without Cross-Contamination Attention. By concatenating instances, packing in training models can enhance training effectiveness. When examples are handled carelessly, contamination might happen since the focus isn't sure where to end. Although the community has discovered that EOS is frequent enough, issues can nevertheless arise. This repository offers a Hugging Face implementation for popular models to correctly compress input data.
ZLUDA. ZLUDA lets you run unmodified CUDA applications with near-native performance on AMD GPUs.
GenTranslate. A novel method called GenTranslate leverages massive language models to enhance translation quality. The best translations produced by foundational models are the main focus. Tests have shown that the approach performs better than the state-of-the-art translation models.
Design2Code. Design2Code is an open-source project that converts various web design formats, including sketches, wireframes, Figma, XD, etc., into clean and responsive HTML/CSS/JS code. Just upload your design image, and Design2Code will automatically generate the code for you. It's that simple!
SGLang. SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
DALI. This study presents cutting-edge techniques to guarantee that autonomous intelligent agents—which are essential in applications that depend on life—remain morally and ethically sound even as they develop.
Reor Project. Reor is an AI-powered desktop note-taking app: it automatically links related ideas, answers questions on your notes and provides semantic search. Everything is stored locally and you can edit your notes with an Obsidian-like markdown editor.
Dinosaur: differentiable dynamics for global atmospheric modeling. The Google group has made code available to support atmospheric modeling. DeepMind's latest weather modeling tools are built around this code.
Neural Flow. This is a Python script for plotting the intermediate layer outputs of Mistral 7B. When you run the script, it produces a 512x256 image representing the output at every layer of the model. The concept is straightforward: collect the output tensors from each layer, normalize them between zero and one, and plot these values as a heatmap. The resulting image reveals a surprising amount of structure. I have found this enormously helpful for visually inspecting outputs when fine-tuning models.
Tabula Rasa: not enough data? Generate them!. How you can apply generative AI to tabular data
A practical guide to neighborhood image processing. Love thy neighbors: How the neighbors are influencing a pixel

Perspectives

Link description
AI agents as a new distribution channel. By making judgments about what to buy on behalf of customers, AI agents are starting to emerge as a new route of distribution that might level the playing field between startups and established players. Businesses will need to adjust their goods to cater to AI tastes instead of human ones as this trend develops, which will alter the conventional dynamics of product appraisal, purchase, and discovery. The development of AI portends a time when agent-driven commerce may completely change the way goods are advertised and bought.
Thinking about High-Quality Human Data. The topic of this piece is how people generate data. It also covers labeling, annotating, and gathering preference data, among other topics.
AI Aesthetics. Artificial Intelligence will radically transform the way we create, appreciate, and produce art. This article delves deeper into this topic and identifies the businesses spearheading the shift.
NYC: Brain2Music. Research talk from Google about reading music from a person’s brain.
Massed Muddler Intelligence. A move away from conventional monolithic AI scaling and toward a paradigm based on distributed, agent-based systems that learn and adapt in real-time is represented by the idea of massed muddler intelligence, or MMI. MMI promotes AI development that stresses scalable, interactive agents with a degree of autonomy and mutual governance, moving away from the current focus on accumulating larger datasets and computational resources. This approach is based on the principles of embodiment, boundary intelligence, temporality, and personhood.
AI Could Actually Help Rebuild The Middle Class. AI doesn’t have to be a job destroyer. It offers us the opportunity to extend expertise to a larger set of workers.
Letter from the YouTube CEO: 4 Big bets for 2024. YouTube is investing in diverse revenue streams for creators. The platform witnessed a 50% increase in the use of channel memberships. It is creating creator support networks through programs like the Creator Collective. Efforts are undertaken to help politicians appreciate and respect the economic and entertainment worth of artists.
Meta’s AI Chief Yann LeCun on AGI, Open-Source, and AI Risk. Ahead of the award ceremony in Dubai, LeCun sat down with TIME to discuss the barriers to achieving “artificial general intelligence” (AGI), the merits of Meta’s open-source approach, and what he sees as the “preposterous” claim that AI could pose an existential risk to the human race.
Deepfakes, trolls and cybertroopers: how social media could sway elections in 2024. Faced with data restrictions and harassment, researchers are mapping out fresh approaches to studying social media’s political reach.
Why "Chat over Your Data" Is Harder Than You Think. Contrary to popular belief, developing chat-based, domain-specific LLM applications and copilots is challenging. Achieving strong performance, managing intricate queries and data, and providing robust data retrieval for LLM-based chat apps are a few of the difficulties.

meme-of-the-week

Back to index

ML news: Week 5 - 11 February

Research

Link description
Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection. When it comes to detecting bogus news, a refined BERT model performs better than an off-the-shelf LLM like GPT-3.5-turbo.
PAP-REC: Personalized Automatic Prompt for Recommendation Language Model. To improve the efficacy and efficiency of Recommendation Language Models, PAP-REC has developed a technique that automatically generates tailored prompts.
PAM: Prompting Audio-Language Models for Audio Quality Assessment. PAM is a tool that evaluates audio quality without reference tracks or specific training by using Audio-Language Models.
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning. AnimateLCM is a novel method that divides the learning process into two halves to rapidly produce high-quality films and enhance current video diffusion models.
Boximator: Generating Rich and Controllable Motions for Video Synthesis. Controlling video synthesis is a well-known challenge. This paper suggests guiding the generation using boxes and arrows over time, which enhances human preference judgment but still leaves the user with imperfect guidance.
KTO: Model Alignment as Prospect Theoretic Optimization. Kahneman-Tversky Optimization (KTO) is a novel method for conditioning AI models to more closely resemble human thought processes. Utilizing ideas from prospect theory developed by Kahneman & Tversky, KTO prioritizes utility above preference likelihood.
A simple method to reduce hallucination in Large Vision-Language Models. This study clarifies the reasons for multimodal hallucination, a condition in which large vision-language models (LVLMs) occasionally represent visuals erroneously. One important factor is semantic shift bias, especially at paragraph breaks.
CapHuman: Capture Your Moments in Parallel Universes. Given only one reference facial photograph, our CapHuman can generate photo-realistic specific individual portraits with content-rich representations and diverse head positions, poses, facial expressions, and illuminations in different contexts.
Nomic Embed: Training a Reproducible Long Context Text Embedder. Nomic-Embed-Text-V1 is an open-source, completely reproducible text embedding model that raises the bar. It does well on activities with both short and lengthy contexts. Nomic-Embed-Text-V1, which is transparent to the extreme, provides full access to its model weights, training code, and a large dataset consisting of 235 million text pairs.
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? Training large-scale picture models is difficult due to legitimate copyright concerns and the disappearance of large-scale datasets like LAION. This work demonstrates that 30 million artificially created pictures may be used to train a strong CLIP model.
Rethinking Optimization and Architecture for Tiny Language Models. This work investigates how to focus on small models with fewer parameters to develop strong language models better suited for mobile devices.
Unified Hallucination Detection for Multimodal Large Language Models. To address the important problem of hallucinations in Multimodal Large Language Models (MLLMs), researchers have created a new benchmark called MHaluBench, which is used to assess different hallucination detection techniques.
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions. With InteractiveVideo, users may now create videos in a new style that allows for dynamic user interaction. This intuitive framework, in contrast to conventional techniques, enables real-time adjustments utilizing text, graphics, painting, and even drag-and-drop.
DeepSeekMath. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4.
Natural language guidance of high-fidelity text-to-speech models with synthetic annotations. These Stability AI-trained text-to-speech algorithms can follow exact natural language commands. Its developers artificially annotated a sizable corpus of speech for training as there isn't a sizable dataset with appropriate textual descriptions of audio for creation. This is a further illustration of the larger trend of generative modeling training, up-captioning, and annotating.
MusicRL: Aligning Music Generation to Human Preferences. The Google MusicLM team used an RL approach on their music-generating models using 300k feedback pieces and other incentive signals. They discovered that in human preference experiments, it performs better than the base model; nonetheless, it is not evident whether RL technique produces the greatest quality output.
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation. To increase CLIP's performance in picture classification tasks without needing more training or resources, this article revisits the traditional Gaussian Discriminant Analysis (GDA) approach.
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. The line of sophisticated vision-language models for mobile devices known as MobileVLM V2 offers appreciable performance gains thanks to creative architecture.
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs. According to a recent study, multi-modal large language models (MLLMs) like GPT-4V have a flaw in that they make mistakes when dealing with particular kinds of image-text inputs. A benchmark called CorrelationQA was created to assess how well MLLMs performed in situations where text could be contradicted or misled by visuals.
Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction. The creation of a generalist AI agent that can comprehend and adhere to gaming instructions is examined in this research as a first step toward "read-to-play" capabilities. The researchers incorporate multimodal game instructions into a decision transformer to improve the agent's multitasking and generalization abilities.
MetaTree: Learning a Decision Tree Algorithm with Transformers. MetaTree is a transformer-based decision tree algorithm. It learns from classical decision tree algorithms for better generalization capabilities.

News

Link description
Sakana Awarded Japanese Government Supercomputing Grant. Sakana AI is one of seven institutions in Japan chosen by the Japanese government to receive a supercomputing grant, for encouraging the development of foundation AI models to strengthen the capabilities of Japan’s generative AI ecosystem.
Hugging Face launches open source AI assistant maker to rival OpenAI’s custom GPTs. Hugging Face, the New York City-based startup that offers a popular, developer-focused repository for open source AI code and frameworks (and hosted last year’s “Woodstock of AI”), today announced the launch of third-party, customizable Hugging Chat Assistants.
Arc is building an AI agent that browses on your behalf. The Browser Company, which makes the Arc Browser, is on a quest to change that by building an AI that surfs the web for you and gets you the results while bypassing search engines.
Introducing Qwen1.5. 0.5B to 72B range of parameters. This collection of multilingual models is outstanding. It's interesting to note that the first significant sub-1B parameter language model is the smallest model.
Inside OpenAI’s Plan to Make AI More ‘Democratic’. Colin Megill met with Wojciech Zaremba, co-founder of OpenAI, in May 2023 to talk about integrating Polis, an AI-powered public debating platform that promotes democratic involvement. The cooperation sought to use public feedback to match AI with human ideals. It started the "Democratic Inputs to AI" project at OpenAI, which aims to investigate AI governance through a $1 million award program.
Roblox releases real-time AI chat translator. Roblox built an AI model that it says translates text chats so quickly users may not even notice it’s translating the messages of other players at first. It works with 16 languages, including English, French, Japanese, Thai, Polish, and Vietnamese.
OpenAI is adding new watermarks to DALL-E 3. OpenAI says watermarks in image metadata are not perfect, but they help build trust of digital information.
Microsoft Copilot for Sales and Copilot for Service are now generally available. The AI-powered Copilot for Sales and Service from Microsoft is now widely accessible. It increases the efficiency of sales and support staff by integrating with CRM platforms like Salesforce. The solutions promise to improve customer interactions and expedite company operations by automating repetitive tasks and providing insights directly within Microsoft 365 apps. Early users of these AI capabilities, such as Avanade, report considerable time savings and improved client engagement.
First passages of rolled-up Herculaneum scroll revealed. Researchers used artificial intelligence to decipher the text of 2,000-year-old charred papyrus scripts, unveiling musings on music and capers.
IBM wants to build a 100,000-qubit quantum computer. The company wants to make large-scale quantum computers a reality within just 10 years.
Microsoft brings new AI image functionality to Copilot, adds new model Deucalion. In a startling move, Microsoft today announced a redesigned look for its Copilot AI search and chatbot experience on the web (formerly known as Bing Chat), new built-in AI image creation and editing functionality, and a new AI model, Deucalion, that is powering one version of Copilot.
Meet ‘Smaug-72B’: The new king of open-source AI. A new open-source language model has claimed the throne of the best in the world, according to the latest rankings from Hugging Face, one of the leading platforms for natural language processing (NLP) research and applications.
EU’s AI Act passes last big hurdle on the way to adoption. The European Union’s AI Act, a risk-based plan for regulating applications of artificial intelligence, has passed what looks to be the final big hurdle standing in the way of adoption after Member State representatives today voted to confirm the final text of the draft law.
OpenAI forms a new team to study child safety. Under scrutiny from activists — and parents — OpenAI has formed a new team to study ways to prevent its AI tools from being misused or abused by kids.
Human brain cells hooked up to a chip can do speech recognition. Clusters of brain cells grown in the lab have shown potential as a new type of hybrid bio-computer.
Bard becomes Gemini: Try Ultra 1.0 and a new mobile app today. You may now finally interact with Gemini Ultra 1.0 thanks to a new service that Google established. However, access to the model will need a monthly subscription fee. Additionally, a companion smartphone app exists.
1X robotics demonstration. One robotics startup, 1X, has achieved significant advancements in video-to-control models. The robot, which is powered by neural networks that generate 10 Hz control impulses from visual input, has been demonstrated by the business executing a variety of tasks.
AR glasses with multimodal AI nets funding from Pokémon GO creator. Today, Singapore-based Brilliant Labs announced its new product, Frame, a pair of lightweight AR glasses powered by a multimodal AI assistant called Noa. The glasses have captured the attention and investment of John Hanke, CEO of Niantic, the augmented reality platform behind games like Pokémon GO.

Resources

Link description
aphrodite-engine. For AI inference workloads, the Aphrodite engine can increase throughput while lowering VRAM needs.
chatllm-vscode. ChatLLM is a VSCode extension for interacting with LLM APIs in a flexible and long-form manner. It leverages the VSCode notebook support to do so, creating a new type of notebook (.chatllm) files where you can interact with an (API-based) LLM system over a long document.
diffusers v0.26.0. This new release comes with two new video pipelines, a more unified and consistent experience for single-file checkpoint loading, support for multiple IP-Adapters’ inference with multiple reference images, and more.
Ollama vision models. Recently, support for vision models was introduced by Ollama. Llava 1.6 comes with both Python and JavaScript packages that offer enhanced support and vision functionality.
Image to Music v2. Images to text, text to prompt, and prompt to music can all be translated into a visually appealing pipeline.
3DTopia. A two-stage text-to-3D generation model. The first stage uses a diffusion model to quickly generate candidates. The second stage refines the assets chosen from the first stage.
Open Source Alternative to Rabbit. An open-source version of the Rabbit hardware, complete with language modeling, is being developed by a team.
NaturalSQL by ChatDB. NaturalSQL by ChatDB is a series of models with state-of-the-art performance on Text to SQL instructions.
contextual_bandits_tutorial. Meta maintains the RL framework called Pearls. This tutorial uses the program to walk through a bandit-based learning problem.
BRIA Background Removal v1.4 Model Card. RMBG v1.4 is our state-of-the-art background removal model, designed to effectively separate foreground from background in a range of categories and image types. This model has been trained on a carefully selected dataset, which includes: general stock images, e-commerce, gaming, and advertising content, making it suitable for commercial use cases powering enterprise content creation at scale.
MetaVoice-1B. a small and powerful text-to-speech model that supports generation and voice cloning.
Latxa. Latxa is a collection of foundation models specifically tuned for Basque.
fabric. An open-source framework for augmenting humans using AI.
YOLO-World. The process of locating objects and their bounding boxes is called object detection. Usually, only a predetermined selection of items selected during training may be used for this. This study presents a real-time approach capable of Open Vocabulary object identification, i.e., detecting bounding boxes for any combination of objects supplied at run-time.
SELF-DISCOVER. the implementation of SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures. a novel prompting technique that allows language models to use a set of reasoning primitives to discover a larger framework for problem-specific reasoning.
AI Filter. AI Filter is a Chrome extension that uses a local language model to filter your social media feeds (currently, only Twitter / X) according to your instructions.
Fully Local RAG using Ollama & PgVector. Using Ollama, pgvector, and local data, you can create a complex and potent RAG system that operates on your hardware.
LightEval . LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
CogCoM. CogCoM is a general vision-language model (VLM) endowed with a Chain of Manipulations (CoM) mechanism, that enables VLMs to perform multi-turns evidential visual reasoning by actively manipulating the input image. We now release CogCoM-base-17b, a model with 10 billion visual parameters and 7 billion language parameters, trained on a data fusion of 4 types of capabilities (instruction-following, OCR, detailed-captioning, and CoM).
How we got fine-tuning Mistral-7B to not suck: Helix Project Report, Feb 2024. By using a set of a pair of questions that gathered material from a variety of viewpoints and produced a content-addressed hash for every document, HelixML was able to improve Mistral-7B.
VatsaDev/animebench-alpha. a benchmark dataset with quotes and information about different anime characters to evaluate language model performance.
NextBrain: a next-generation, histological atlas of the human brain for high-resolution neuroimaging studies. We present a next-generation probabilistic atlas of the human brain using histological sections of five full human hemispheres with manual annotations for 333 regions of interest. This website enables the interactive inspection of these five cases using a 3D navigation interface and search functionality.
Efficient Linear Model Merging for LLMs. Model merging is a technique for combining multiple pre-trained or finetuned LLMs into a single, more powerful model. This approach is particularly useful when individual models excel in different domains or tasks, and merging them can create a model with a broader range of capabilities and improved overall performance.

Perspectives

Link description
MIT Paper: AI’s Labor Market Impacts Are Slower Than Expected. The economic feasibility of automating vision-based operations is examined in the working paper "Beyond AI Exposure: Which Tasks are Cost-Effective to Automate with Computer Vision?" authored by researchers from IBM and MIT. Just 23% of them are profitable to automate, it was discovered. In contrast with more disruptive expectations, the report projects a gradual impact on the job market over several years.
How AI Is Helping Us Learn About Birds. Machine learning is powering new insights into how birds migrate—and forecasts about where they’ll go next
The Techno-Industrial Revolution. The increasing sophistication of AI tooling and corporate use cases will lead to an increasing number of practical uses of the technology. The potential here can be viewed through the lens of how AI will increase margins significantly while lowering costs and improving process efficiency. This could open the door to entirely new approaches that weren't previously viable due to extremely narrow profit margins. A couple of these examples are examined in this article.
The path to profitability for AI in 2024. The emphasis of AI research has recently shifted from accuracy and breadth to efficiency and depth. AI's increasing energy consumption and NVIDIA's H100 sales demonstrate the industry's size. Research is now focused on smaller, more efficient models, such as Phi 2, and emphasizes sustainable economics from model architecture to deployment, all because investments expect profitability. AI's computational efficiency and energy efficiency are expected to increase with advancements in training, fine-tuning, and design. On-device features are a reflection of a larger movement towards more useful and sustainable AI applications.
How design drove $10M in preorders for Rabbit R1 AI hardware. In an expansive interview, Rabbit CEO Jesse Lyu shares how he collaborates with Teenage Engineering, why he didn’t want to make a phone, and how the R1’s retro-future design is key to the company’s strategy.
What’s next for robotaxis in 2024. In addition to restoring public trust, robotaxi companies need to prove that their business models can compete with Uber and taxis.
Google's Gemini Advanced: Tasting Notes and Implications. Similar to OpenAI's GPT-4, Google's recently released GPT-4 class AI model, Gemini Advanced, exhibits comparable characteristics. It excels at providing explanations and fusing search with images.
Thesis on value accumulation in AI. This investor's perspective breaks down the layers of value that exist in AI today into three categories: AI-enhanced products (like all of you that use AI to improve your products), modeling and core (like OpenAI and Anthropic), and infrastructure layer (like cloud providers and chip makers).

meme-of-the-week

Back to index

ML news: Week 29 January - 4 February

Research

Link description
Matryoshka Representation Learning. The new embeddings from OpenAI are scalable to meet your demands. This is thought to be caused by the learning strategy known as the nesting doll approach, which learns characteristics at different granularities.
Vivim: a Video Vision Mamba for Medical Video Object Segmentation. A new framework called Vivim efficiently processes lengthy video sequences for medical video object segmentation. In comparison to conventional techniques, Vivim provides faster and more accurate segmentation results by effectively compressing spatiotemporal data using the state space model methodology.
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities. This study presents a unique way to improve transformers by utilizing disparate input from many modalities, e.g., audio data to improve an image model. By connecting the transformers of two distinct modalities in a unique way, the Multimodal Pathway enables a target modality to profit from the advantages of another.
pix2gestalt: Amodal Segmentation by Synthesizing Wholes. A framework called Pix2Gestalt is intended for zero-shot amodal segmentation. When an item is partially occluded, it can rebuild its entire shape and look with great skill. Pix2Gestalt, which makes use of large-scale diffusion models, performs exceptionally well in difficult situations, such as producing artistic images that break convention.
Large-Vocabulary 3D Diffusion Model with Transformer. The variety of objects that may be generated in 3D poses a significant difficulty. This study builds up the system to operate with a considerably bigger range of items in each 3D category and employs a changed architecture to enhance sampling efficiency.
SliceGPT: Compress Large Language Models by Deleting Rows and Columns. Another potential distillation work. Importantly, this one can work on models as small as Phi-2. This means you can remove 90% of the rows and columns of weight matrices with minimal reduction to quality at almost all scales.
Learning Universal Predictors. The process of teaching systems to learn from experience and swiftly adjust to new tasks is known as meta-learning. With artificial data produced by a Universal Turing Machine, this Google project enhances Meta-Learning and conducts both theoretical and experimental analysis of the outcomes.
CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion. CreativeSynth is an artistic picture editing technique that combines text and image inputs in a seamless manner. Its diffusion approach, which has specialized attention processes built in, allows for fine alteration of both style and content while maintaining the essential elements of the original artwork.
Annotated Hands for Generative Models. By adding three more channels to training photos for hand annotations, researchers have increased the capacity of generative models, such as GANs and diffusion models, to produce realistic hand images.
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Many AI systems employ the concept of "up captioning" to enhance labels during training. This work from Apple rephrases C4 as instructions, Q&A pairs, and more in order to apply it to pre-training. The rephrasing step increased convergence by 10x, according to the study, making the model significantly more sample-efficient, albeit at the expense of the rephrasing step itself.
Continual Learning with Pre-Trained Models: A Survey. This work provides an extensive overview of the most recent developments in continuous learning, which is centered on continually adjusting to new information while preserving prior understanding.
MacGNN. The MAcro Recommendation Graph (MAG) and Macro Graph Neural Networks (MacGNN) are introduced in this research. These methods greatly reduce the number of nodes by assembling similar behavior patterns into macro nodes, which addresses the computational difficulty of Graph Neural Networks.
Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.
Weaver: Foundation Models for Creative Writing. A group of models called Weaver have been trained especially to narrate stories. On a benchmark for storytelling, the biggest model (34B params) performs better than GPT-4.
Text Image Inpainting via Global Structure-Guided Diffusion Models. In this study, two datasets for handwritten words and scenes are introduced, along with a benchmark. With original, damaged, and assistant photos, the new Global Structure-guided Diffusion Model (GSDM) effectively recovers clean texts by making use of text structure. Both picture quality and identification accuracy demonstrate notable gains.
Multi-granularity Correspondence Learning from Long-term Noisy Videos. With Norton, the multi-granularity noisy correspondence problem in video-language studies is addressed, offering a novel strategy for enhancing long-term video comprehension.
GPAvatar: Generalizable and Precise Head Avatar from Image(s). With the use of a Multi Tri-planes Attention module and a dynamic point-based expression field, GPAvatar presents a novel technique for generating 3D head avatars from photos.
MobileDiffusion: Rapid text-to-image generation on-device. With certain architectural modifications, Google has demonstrated a latent consistency diffusion model that it trained for sub-second generation times on mobile devices.
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks. Shared Network Pre-training (SNP) enhances the joint learning of text and video. Compared to earlier models, this approach is more effective and adaptable and incorporates a novel technique called Significant Semantic Strengthening (S3) to improve comprehension of important terms in sentences.
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation. An improved version of the Segment Anything Model (SAM) with a focus on hierarchical text segmentation is called Hi-SAM. Hi-SAM is an excellent text segmenter at several levels, ranging from strokes to paragraphs, and it can even analyze layouts.

News

Link description
Voltron Data acquires Claypot to unlock real-time AI with modular data systems. Today, San Francisco-based Voltron Data, a startup providing enterprises with a modular and composable approach to building systems for data analytics, confirmed to VentureBeat that is acquiring real-time AI platform Claypot. The terms of the deal were not disclosed.
FTC investigating Microsoft, Amazon, and Google investments into OpenAI and Anthropic. The commission wants to understand the tangled web of investments between cloud providers and AI startups.
Google’s New AI Is Learning to Diagnose Patients. The DeepMind team turns to medicine with an AI model named AMIE
1/100th of the cost: CPU startup Tachyum claims that one of its processing units can rival dozens of Nvidia H200 GPUs — with a 99% saving that could turn the AI market on its head if true. The 5nm Prodigy processor can dynamically switch between AI, HPC, and cloud workloads and costs $23,000
ChatGPT is violating Europe’s privacy laws, Italian DPA tells OpenAI. OpenAI has been told it’s suspected of violating European Union privacy, following a multi-month investigation of its AI chatbot, ChatGPT, by Italy’s data protection authority.
This whimsical clock is the playful gadget AI needs right now. The Poem/1 clock dreams up a new poem every minute to tell you the time. Do you need it? No. But you might want it.
iOS 17.4: Apple continues work on AI-powered Siri and Messages features, with help from ChatGPT. Apple is widely expected to unveil major new artificial intelligence features with iOS 18 in June. Code found by 9to5Mac in the first beta of iOS 17.4 shows that Apple is continuing to work on a new version of Siri powered by large language model technology, with a little help from other sources.
Opera to launch new AI-powered browser for iOS in Europe following Apple’s DMA changes. Opera revealed today that it will launch a new AI-powered browser built on its own engine for iOS in Europe. The Norway-based company announced the change following the news that Apple is going to allow alternative browser engines to run on iOS as a result of the requirements of the European Digital Markets Act (DMA).
Mistral CEO confirms ‘leak’ of new open source AI model nearing GPT-4 performance. The past few days have been a wild ride for the growing open source AI community — even by its fast-moving and freewheeling standards.
Microsoft LASERs away LLM inaccuracies. Microsoft’s LASER method seems counterintuitive, but it makes models trained on large amounts of data smaller and more accurate.
LLaVA-1.6: Improved reasoning, OCR, and world knowledge. The most recent iteration of the visual language model Llava features enhanced reasoning, global knowledge, and OCR. It complements Gemini in some duties. The model, code, and data will be made available by the Llava team.
ServiceNow’s statement on AI. The $150 billion market capitalization business ServiceNow revealed last week that, among all of its new product family launches, including its initial Pro SKU, its generation AI solutions generated the biggest net new ACV contribution for the first full quarter. It's exciting to see that enterprise-level AI applications are already contributing to significant revenue growth.
Bard’s latest updates: Access Gemini Pro globally and generate images. You can now generate images in Bard in English in most countries around the world, at no cost. This new capability is powered by our updated Imagen 2 model
Amazon debuts ‘Rufus,’ an AI shopping assistant in its mobile app. Amazon announced today the launch of an AI-powered shopping assistant it’s calling Rufus that’s been trained on the e-commerce giant’s product catalog as well as information from around the web.

Resources

Link description
imp-v1-3b . An additional multimodal model trained using SigLIP and Phi-2. This one is tiny enough to run on-device and provides very promising performance.
WebDataset. WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
LLMs-from-scratch. An unfinished yet intriguing series of exercises to teach language model building from the beginning.
Exploring ColBERT with RAGatouille. For RAG applications, ColBERT is a great paradigm to embed queries and index data. This article runs some benchmarks and examines the method's underlying intuition.
mamba.rs. Inspired by efforts on the Llama models, this project uses pure Rust to run inference for Mamba on the CPU.
🦙 Code Llama. Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer.
Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5). A brand new era for the RWKV-v5 architecture and linear transformer has arrived - with the strongest multi-lingual model in open source today
InconsistencyMasks. A novel technique for picture segmentation called Inconsistency Masks (IM) functions even with sparse data. Tested on the ISIC 2018 dataset, our method performs better than conventional methods and even surpasses models trained on fully labeled datasets.
distortion-generator. A novel technique for picture distortion strikes a compromise between privacy and accuracy in biometric systems, rendering facial photos incomprehensible to humans yet identifiable to AI.
TaskingAI. TaskingAI brings Firebase's simplicity to AI-native app development. The platform enables the creation of GPTs-like multi-tenant applications using a wide range of LLMs from various providers. It features distinct, modular functions such as Inference, Retrieval, Assistant, and Tool, seamlessly integrated to enhance the development process.
100x Faster Clustering with Lilac Garden. A difficulty in language model training is locating a sufficiently varied dataset. It is considerably more difficult to visualize this data. This useful tool facilitates data exploration to enhance filtering and overall quality through topic modeling and quick clustering.
float8_experimental. Although less precise model training is quicker and less expensive, it is less reliable. Quantized training has been the subject of several excellent contemporary studies. Building on those foundations, this repository offers float8 teaching through readable and hackable code.
Enchanted. Enchanted is an open-source, Ollama-compatible, elegant iOS/iPad mobile app for chatting with privately hosted models such as Llama 2, Mistral, Vicuna, Starling, and more. It's essentially ChatGPT app UI that connects to your private Ollama models. You can download Enchanted from the App Store or build yourself from scratch.
Introduction to point processing. Whether you are doing medical image analysis or you use Photoshop, you are using point preprocessing
MF-MOS: A Motion-Focused Model for Moving Object Segmentation. A new model called MF-MOS makes use of LiDAR technology to more effectively identify moving objects during autonomous driving. Using residual maps for motion capture and range pictures for semantic guiding distinguishes motion from semantic information in a unique way.
Mctx: MCTS-in-JAX. Mctx is a library with a JAX-native implementation of Monte Carlo tree search (MCTS) algorithms such as AlphaZero, MuZero, and Gumbel MuZero. For computation speed up, the implementation fully supports JIT-compilation.
FireLLaVA: the first commercially permissive OSS LLaVA model. A new open vision model called FireLlava can be used for commercial applications after it was trained on data. It performs similarly to the first Llava, but not quite as well as Llava 1.5.
uAgents: AI Agent Framework. uAgents is a library developed by Fetch.ai that allows for the creation of autonomous AI agents in Python. With simple and expressive decorators, you can have an agent that performs various tasks on a schedule or takes action on various events.
teknium/OpenHermes-2.5. Some of the top open models available have been trained using data from OpenHermes-2.5. More than one million high-quality data points are included in the collection. It's now available for purchase.
OLMo: Open Language Model. A State-Of-The-Art, Truly Open LLM and Framework
BAAI/bge-m3. A flexible embedding model that performs very well in multi-functionality (dense, multi-vector, and sparse retrieval), multi-linguistic (supporting more than 100 languages), and multi-granularity (managing inputs ranging from brief phrases to documents with up to 8192 tokens) is presented by the BGE-M3 project. It makes use of a hybrid retrieval pipeline, which leverages its simultaneous embedding and sparse retrieval capabilities, to combine several techniques and re-ranking for increased accuracy and generalization.
RAGs. Using natural language, users can develop RAG pipelines from data sources with the help of the Streamlit app RAGs. All users need to do is specify the parameters and tasks they require from their RAG systems. You can query the RAG, and it will respond to inquiries about the information.
GPT Newspaper. GPT Newspaper project, an innovative autonomous agent designed to create personalized newspapers tailored to user preferences. GPT Newspaper revolutionizes the way we consume news by leveraging the power of AI to curate, write, design, and edit content based on individual tastes and interests.

Perspectives

Link description
Many AI Safety Orgs Have Tried to Criminalize Currently-Existing Open-Source AI. Numerous teams are attempting to address the difficulties posed by the quickly developing field of artificial intelligence.
AlphaFold found thousands of possible psychedelics. Will its predictions help drug discovery? Researchers have doubted how useful the AI protein-structure tool will be in discovering medicines — now they are learning how to deploy it effectively.
Reaching carbon neutrality requires energy-efficient training of AI. Artificial intelligence (AI) models have achieved remarkable success, but their training requires a huge amount of energy.
What will robots think of us? Two recent science fiction novels humorously illustrate the importance of correct robot mental models.
What Can be Done in 59 Seconds: An Opportunity (and a Crisis). AI is already capable of completing several jobs in less than a minute, thus businesses and staff will need to stress the need to utilize AI for good rather than evil.
The American Dynamism 50: AI. This list of 50 companies, compiled by a16z, addresses some of the most important issues facing the US in the areas of manufacturing, transportation, energy, and military. They're all utilizing AI to speed up their work in one way or another. This is an excellent insight if you're interested in practical uses of artificial intelligence.

meme-of-the-week

Back to index

ML news: Week 22 - 28 January

Research

Link description
OMG-Seg: Is One Model Good Enough For All Segmentation?. OMG-Seg can handle over ten different segmentation tasks in one framework, including image-level and video-level segmentation tasks, interactive segmentation, and open-vocabulary segmentation. To our knowledge, this is the first model to unify these four directions.
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation. BriVIS, an approach that enhances open-vocabulary Video Instance Segmentation (VIS), was created by researchers. BriVIS achieves a more precise alignment between text and video by preserving the context of object motions across video frames through the use of a method known as Brownian Bridges.
Encoder-minimal and Decoder-minimal Framework for Remote Sensing Image Dehazing. A novel framework called RSHazeNet was created to eliminate haze from remote-sensing photos. The tool makes use of cutting-edge modules to enhance image comprehension and detail preservation, improving clarity and analytical use.
Supervised Fine-tuning in turn Improves Visual Foundation Models. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Group Anything with Radiance Fields. Hierarchical grouping in 3D by training a scale-conditioned affinity field from multi-level masks
DiverseEvol. We introduce DiverseEvol, an efficient instruction-tuning method that allows the model itself to iteratively sample training subsets to improve its own performance, without any external supervision from humans or more advanced LLMs.
Unleashing the Power of Large-Scale Unlabeled Data. Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs. By enabling users to highlight specific portions of prompts, researchers present the "Prompt Highlighter," a technique that transforms text production in multi-modal language models.
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. A novel generative model called MM-Interleaved is very good at handling and producing interleaved image-text data.
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation. A different preference optimization method is now being used in machine translation. For this job, it is more data-efficient than DPO. Crucially, the goal prevented the model from suggesting correct but inadequate translations, allowing it to perform competitively on WMT.
WARM: On the Benefits of Weight Averaged Reward Models. In RLHF, reward models are employed to simulate human desire; nevertheless, the model that is being aligned frequently "hacks the reward" and performs poorly. The resultant aligned model is favored 79% of the time over one aligned with a single reward model. This is achieved by combining numerous reward models that maintain a linear mode connection. Although model merging may be merely regularization, it has shown to be an effective training phase for the general language model pipeline and has performed fairly well in general models.
Benchmarking Large Multimodal Models against Common Corruptions. This technical study introduces MMCBench, a new benchmark created to evaluate large multimodal models' (LMMs) consistency and dependability on a variety of tasks, including text-to-image and speech-to-text. It covers more than 100 well-known models with the goal of helping readers better comprehend how various AI systems function in practical situations.
Predicting multiple conformations via sequence clustering and AlphaFold2. AlphaFold2 has revolutionized structural biology by accurately predicting single structures of proteins. However, a protein’s biological function often depends on multiple conformational substates, and disease-causing point mutations often cause population changes within these substates
HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds. HEDNet is a novel encoder-decoder network that aims to improve autonomous cars' ability to recognize 3D objects by tackling the problem of sparse point distribution in 3D situations.
Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking. This project proposes a novel prompt pool approach to recording the status of dialogs that do not need task IDs during testing, allowing it to adjust to changing user requirements.
DittoGym: Learning to Control Soft Shape-Shifting Robots. A major problem with soft robotics is the wide control space. In this study, a simulator with a variety of tasks for handling soft objects that resemble "dittos" is introduced. It includes several powerful baselines, visualization, and utilities.
SGTR+: End-to-end Scene Graph Generation with Transformer. A novel technique that researchers have created speeds up and improves the efficiency of the scene graph creation process. Their transformer-based approach aims to enhance the model's comprehension and interconnection of many parts in a picture, resulting in enhanced performance on complex tasks.
DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. Based on how similar two photographs are to one another, image similarity systems provide a score. This study builds upon earlier approaches, mainly by using artificial intelligence and human preferences.
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation. A model called SegMamba is intended for 3D medical image segmentation. In comparison to the Transformer architecture, it provides a more effective option.
SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation. To improve semantic segmentation, researchers have created the Shared Feature Calibration (SFC) technique.

News

Link description
OpenAI’s Sam Altman Is Raising Money to Set Up AI Chip Factories. A new report reveals that OpenAI CEO Sam Altman is gearing up to raise money to set up his own network of AI chip factories.
Google DeepMind scientists in talks to leave and form AI startup. A pair of scientists at Google's artificial intelligence subsidiary DeepMind is in talks with investors to form an AI startup in Paris, Bloomberg News reported on Friday, citing people familiar with the conversations.
The AI phones are coming. We’re tired of tapping through apps on our phones all day. Can Samsung show us an AI tool to save us?
How Microsoft found a potential new battery material using AI. Advances in AI and high-performance computing are changing the way scientists look for new battery materials.
Google will pitch Bard Advanced as providing ‘complex, better responses’. At the start of December, Google said Gemini Ultra would launch in early 2024 and be available in “Bard Advanced.” When it launches, Google will position Bard Advanced as providing “complex, better responses.”
Stability AI unveils smaller, more efficient 1.6B language model as part of ongoing innovation. Stability AI, the vendor that is perhaps best known for its stable diffusion text to image generative AI technology, today released one of its smallest models yet, with the debut of Stable LM 2 1.6B.
Tesla finally releases FSD v12, its last hope for self-driving. Tesla has finally started releasing its FSD Beta v12 update to customers, which is sort of its last hope to deliver on its self-driving promises.
Code LoRA From Scratch. LoRA, which stands for Low-Rank Adaptation, is a popular technique to finetune LLMs more efficiently. Instead of adjusting all the parameters of a deep neural network, LoRA focuses on updating only a small set of low-rank matrices. This Studio explains how LoRA works by coding it from scratch, which is an excellent exercise for looking under the hood of an algorithm.
Microsoft’s Nadella Wants Stability at OpenAI, Not Control. In the midst of regulatory reviews in the EU and the UK, Microsoft CEO Satya Nadella is happy with the current condition of Microsoft's cooperation with OpenAI, emphasizing stability above control. He highlights both Microsoft's substantial funding in OpenAI and their own autonomous AI research.
ElevenLabs Releases New Voice AI Products and Raises $80M Series B. To strengthen its position in voice AI research and product development
Google Chrome gains AI features, including a writing helper, theme creator, and tab organizer. Google’s Chrome web browser is getting an infusion of AI technology in the latest release. The company announced today it’s soon adding a trio of new AI-powered features to Chrome for Mac and Windows, including a way to smartly organize your tabs, customize your theme, and get help when writing things on the web — like forum posts, online reviews, and more.
Anthropic researchers find that AI models can be trained to deceive. Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.
Google shows off Lumiere, a space-time diffusion model for realistic AI videos . Lumiere, a space-time diffusion model proposed by researchers from Google, Weizmann Institute of Science and Tel Aviv University to help with realistic video generation.
Adept Fuyu-Heavy: A new multimodal model. Adept Fuyu-Heavy is a new multimodal model designed specifically for digital agents. In particular, Fuyu-Heavy scores higher on the MMMU benchmark than even Gemini Pro.
Report: Apple Making ‘Significant’ Push to Bring AI to iPhones. Apple is reportedly making a major push to bring artificial intelligence (AI) to the iPhone.
Hugging Face and Google partner for open AI collaboration. Today, we are thrilled to announce our strategic partnership with Google Cloud to democratize good machine learning. We will collaborate with Google across open science, open source, cloud, and hardware to enable companies to build their own AI with the latest open models from Hugging Face and the latest cloud and hardware features from Google Cloud.
OpenAI's New embedding models and API updates. We are launching a new generation of embedding models, new GPT-4 Turbo and moderation models, new API usage management tools, and soon, lower pricing on GPT-3.5 Turbo.
Announcing Qdrant's $28M Series A Funding Round. The firm behind the vector database, which powers some of ChatGPT and X's "More like this," has secured funds to enhance its corporate solutions and extend its Rust-based vector store.

Resources

Link description
nanotron. The objective of this library is to provide easily distributed primitives in order to train a variety of models efficiently using 3D parallelism.
DataTrove. DataTrove is a library to process, filter, and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
CaptionIMG. A Simple program is written in Python to manually caption your images (or any other file types) so you can use them for AI training. I use it for Dreambooth training (StableDiffusion).
AI Toolkit. AI Toolkit is a header-only C++ library that provides tools for building the brain of your game's NPCs.
Face Mixer Diffusion. This piece demonstrates how to clone faces in photos using diffusion. Although there are other methods for creating deep fakes, diffusion is intriguing since it allows for the necessary inpainting of other image elements.
Self-Rewarding Language Model. Implementation of the training framework proposed in the Self-Rewarding Language Model, from MetaAI
snorkelai/Snorkel-Mistral-PairRM-DPO. A powerful new Mistral tune that creates a DPO-compatible dataset by cleverly using poor supervision and synthetic data. Numerous iterations of the described procedure can be used for a broad range of corporate use cases.
nanoColBERT. ColBERT is a powerful late-interaction model that can perform both retrieval and reranking.
RPG-DiffusionMaster. RPG is a powerful training-free paradigm that can utilize proprietary MLLMs (e.g., GPT-4, Gemini-Pro) or open-source local MLLMs (e.g., miniGPT-4) as the prompt reception and region planner with our complementary regional diffusion to achieve SOTA text-to-image generation and editing. Our framework is very flexible and can generalize to arbitrary MLLM architectures and diffusion backbones.
Matrix Multiplication: Optimizing the code from 6 hours to 1 sec. A brief read about matrix multiplication optimizations particular to certain hardware and a generic procedure to accelerate AI programs.
SyncTalk: Mastering Realism in Talking Head Videos. A significant advancement in realistic talking head videos is SyncTalk. It solves earlier problems with lip motions, expressions, and facial identity synchronization.
Hallucination Leaderboard. Public LLM leaderboard computed using Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.
Embedding English Wikipedia in under 15 minutes. Modal provides a serverless solution for organizations grappling with scaling workloads. Modal’s technology enables rapid scaling across many GPUs, which we can use to run large-scale workloads, such as generating embeddings for a massive text dataset, at lightning speed.
Concrete Steps to Get Started in Transformer Mechanistic Interpretability. Among the founders of Mechanistic Interpretability (MI) is Neel Nanda. This serves as his entry guide into the industry. It has two hundred specific open-ended questions. The research of language models' quantitative values, or MI, involves actually examining neurons. Even though there hasn't been much progress in this area of study yet, it is accessible because it doesn't demand a lot of processing power.
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation. SDD contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in the evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation, and music-language retrieval.
DiffMoog: A Modular Differentiable Commercial-like Synthesizer. This repo contains the implementation of DiffMoog, a differential, subtractive, modular synthesizer, incorporating standard architecture and sound modules commonly found in commercial synthesizers.
TensorDict. TensorDict is a dictionary-like class that inherits properties from tensors, such as indexing, shape operations, casting to device or point-to-point communication in distributed settings. The main purpose of TensorDict is to make code bases more readable and modular by abstracting away tailored operations
Evaluation Metrics for LLM Applications In Production. How to measure the performance of LLM applications without ground truth data.
Asynchronous Local-SGD Training for Language Modeling. This repository contains a Colab notebook that presents a minimal toy example replicating the observed optimization challenge in asynchronous Local-SGD. The task is to perform classification on a mixture of mixtures of Gaussian data.
SpeechGPT: Speech Large Language Models. A novel speech synthesis model called SpeechGPT-Gen effectively manages the intricacies of language and voice traits.
LLM Steer. A Python module to steer LLM responses towards a certain topic/subject and to enhance capabilities (e.g., making it provide correct responses to tricky logical puzzles more often). A practical tool for using activation engineering by adding steer vectors to different layers of a Large Language Model (LLM). It should be used along with the Transformers library.
RoMa: A lightweight library to deal with 3D rotations in PyTorch.. RoMa (which stands for Rotation Manipulation) provides differentiable mappings between 3D rotation representations, mappings from Euclidean to rotation space, and various utilities related to rotations. It is implemented in PyTorch and aims to be an easy-to-use and reasonably efficient toolbox for Machine Learning and gradient-based optimization.
AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agent. AgentBoard is a benchmark designed for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. The main Performance of different LLMs across various environments are shown below, please check our Results for more details.
makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch. This blog walks through implementing a sparse mixture of experts language model from scratch. This is inspired by and largely based on Andrej Karpathy's project 'makemore' and borrows a number of reusable components from that implementation.

Perspectives

Link description
Text-to-Video: The Task, Challenges and the Current State. Text-to-video is next in line in the long list of incredible advances in generative models. How do these models work, how do they differ from text-to-image models, and what kind of performance can we expect from them?
My AI Timelines Have Sped Up (Again). In light of developments in scaling up models, the author updated their forecasts for the AI timetable. As of right now, they predict that artificial general intelligence will be achieved with a 10% probability by 2028 and a 50% likelihood by 2045. The efficacy of massive language models and the knowledge that numerous intelligent capabilities may arise at scale are credited with these changes.
Should The Future Be Human?. Elon Musk and Larry Page have a deep disagreement over the possible risks associated with artificial intelligence. Page has called Musk a "speciesist" for favoring humans over digital life forms, which has caused a gap in their friendship. This demonstrates the necessity for careful and deliberate development of AI technology and reflects the larger discussion on the influence of AI, which includes worries about consciousness, individuation, art, science, philosophy, and the potential for mergers between humans and AI.
Computers make mistakes and AI will make things worse — the law must recognize that. A tragic scandal at the UK Post Office highlights the need for legal change, especially as organizations embrace artificial intelligence to enhance decision-making.
Google AI has better bedside manner than human doctors — and makes better diagnoses. Researchers say their artificial intelligence system could help to democratize medicine.
Tech developers must respect equitable AI access. We argue for a legal framework to ensure equitable access to artificial intelligence (AI) tools, such as ChatGPT, to avoid limiting their benefits to a privileged few
Seven technologies to watch in 2024. Advances in artificial intelligence are at the heart of many of this year’s most exciting areas of technological innovation
If AI Were Conscious, How Would We Know?. When discussing AI consciousness, references to Searle's Chinese Room Thought Experiment and the Turing Test are frequently made. The former examines whether an AI's conduct can be distinguished from that of a human, while the latter contends that exterior behavior is insufficient to demonstrate consciousness. Given that our knowledge of consciousness in AI is mostly derived from functionalist theories and human experiences, this argument emphasizes how difficult it is to define and identify consciousness in AI.
AI today and trends for an AI future. A survey of experts on: How are early adopters using AI today? Where is AI going in 2024?

meme-of-the-week

Back to index

ML news: Week 15 - 21 January

Research

Link description
I am a Strange Dataset: Metalinguistic Tests for Language Models. An example of a self-referential challenge phrase is "the last word in this sentence is." This kind of phrase is extremely difficult for language models to handle. This work presents a dataset and some assessments aimed at enhancing the metalinguistic capabilities of language models.
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models. A complementary line of inquiry to the well-known Stable Diffusion collection of picture generating models has been PixArt. With the use of ControlNet-style prompting and latent consistency models, this work improves control and speeds up creation.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. Anthropic has published some intriguing research in which a sleeper phrase designed to induce a particular response is used to deliberately poison a language model. It discovered that this kind of model could not be "aligned" with the robust system that it utilized for its production models. In other words, once the model was poisoned, negative behavior could not be undone with the resources available today.
PALP: Prompt Aligned Personalization of Text-to-Image Models. Right now, Dreambooth is the most effective way to customize an image model. Prompt alignment is composable and significantly increases adherence to the prompt.
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning. We introduce a novel instruction tuning dataset, INTERS, encompassing 21 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates.
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach. A new dataset called INTERS has been created by researchers with the goal of enhancing the performance of big language models such as Mistral and LLaMA in information retrieval tasks.
HiCMAE. A revolutionary self-supervised learning framework called HiCMAE was created to improve AVER or Audio-Visual Emotion Recognition. This method leverages large-scale pre-training on unlabeled audio-visual data to get over data scarcity issues.
Language Enhanced Multi-modal Grounding Model. A novel end-to-end multimodal grounding model called LEGO exhibits sophisticated comprehension and grounding skills across several modalities, including pictures, sounds, and videos.
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks. Challenging data has long been assumed to be necessary to solve challenging issues, yet this data is noisy and difficult to identify. This work demonstrates that models may be made far more capable of generating solutions to difficult situations by fine-tuning them on related but easy data. A further piece of evidence to back up fine-tuning is that it elicits information rather than imparts it.
Mutual Distillation Learning For Person Re-Identification. By merging two distinct approaches, researchers have created a revolutionary method called Mutual Distillation Learning For Person Re-identification (MDPR) that improves person re-identification.
Large language models help computer programs to evolve. A branch of computer science known as genetic programming has been given a boost with the application of large language models that are trained on the combined intuition of the world’s programmers. comment here.
Solving olympiad geometry without human demonstrations. Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning Blog post from DeepMind.
Fast and Expressive LLM Inference with RadixAttention and SGLang. Two new advances for language model inference have been provided by LMSYS. The first is a backend tweak that raises the performance of tokens per second overall. Prompting parallelism is possible with the second prompting approach, which is an embedded language tailored to a particular domain.
Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities. The difficulty of creating Vision Foundation Models (VFMs) especially for autonomous driving is examined in this research. It offers insights into pre-training, task adaptability, and data preparation in AI by examining more than 250 research articles, showcasing state-of-the-art methods such as 3D Gaussian Splatting and NeRF.
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models. By concentrating on video tasks, DoraemonGPT, a novel artificial intelligence system built on huge language models, advances our comprehension of dynamic real-world events. For effective spatial-temporal querying, it transforms films into a symbolic memory. It also includes specialized tools and an innovative planner for handling challenging tasks.
Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. AlphaCodium presents a new method to improve LLMs' code creation. As evidenced by the CodeContests dataset, this multi-stage, test-based iterative procedure greatly increases the accuracy of models such as GPT-4 in tackling complicated programming tasks.
Foundations of Vector Retrieval. Almost all of the information one may want to know about the current status of the vector retrieval area is covered in this enormous document. It will take some time to go through this important resource.
Learning to Follow Object-Centric Image Editing Instructions Faithfully. This study addresses issues such as ambiguous instructions and selectively selecting regions of the image to modify, hence enhancing the quality of photographs modified with natural language instructions.

News

Link description
OpenAI changes policy to allow military applications. In an unannounced update to its usage policy, OpenAI has opened the door to military applications of its technologies.
Using AI, MIT researchers identify a new class of antibiotic candidates. These compounds can kill methicillin-resistant Staphylococcus aureus (MRSA), a bacterium that causes deadly infections.
Microsoft wants to automatically launch its Copilot AI on some Windows 11 devices. You might see Copilot start automatically opening on Windows 11 soon, but only with certain display situations.
Microsoft launches Copilot Pro for $20 per month per user. Copilot Pro gives you the latest features and best models that Microsoft AI has to offer.
How OpenAI is approaching 2024 worldwide elections. We’re working to prevent abuse, provide transparency on AI-generated content, and improve access to accurate voting information.
Sakana AI raises $30m seed. In Tokyo, Sakana.ai is constructing a state-of-the-art research facility to create foundation models that are more compact and effective. David Ha and Llion Jones, two former Google researchers who are credited with innovations including Transformers, World Models, and LoRA, formed the business. To lead this initiative and establish Tokyo as a leader in AI, it has received a $30 million seed round from Jon Chu at Khosla Ventures and Brandon Reeves at Lux Capital.
Stable Code 3B: Coding on the Edge. Stable Code 3B is a 3 billion parameter Large Language Model (LLM), allowing accurate and responsive code completion at a level on par with models such as CodeLLaMA 7b that are 2.5x larger.
OpenAI announces team to build ‘crowdsourced’ governance ideas into its models. OpenAI says it wants to implement ideas from the public about how to ensure its future AI models “align to the values of humanity.”
OpenAI must defend ChatGPT fabrications after failing to defeat libel suit. ChatGPT users may soon learn whether false outputs will be allowed to ruin lives.
Samsung’s S24 and S24 Plus put new AI smarts in a polished package. The two smaller siblings of the Galaxy S24 Ultra are very similar-looking phones to last year’s devices, but they include new AI-powered features and a promise of seven years of software and security updates.
OpenAI announces first partnership with a university. OpenAI on Thursday announced its first partnership with a higher education institution.
Mark Zuckerberg’s new goal is creating artificial general intelligence. And he wants Meta to open source it. Eventually. Maybe.
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2. 8bit in neural networks is not a new concept. However, shipping 8-bit models in the real world on a large scale is challenging.
Microsoft makes its AI-powered reading tutor free. Microsoft today made Reading Coach, its AI-powered tool that provides learners with personalized reading practice, available at no cost to anyone with a Microsoft account.
Ousted Twitter CEO Parag Agrawal is back with an AI startup; gets $30 mn in funding led by Khosla Ventures. Agrawal is back with an artificial intelligence (AI) startup that has already raised $30 million in funding that is led by Khosla Ventures.

Resources

Link description
Moore-AnimateAnyone. AnimateAnyone is a fantastic video control model that animates the person in the control image by using skeletal motion and an image as input. This code replicates that work in an open manner.
surya. Surya is a multilingual document OCR toolkit
David Attenborough narrates your life. Using a combination of GPT4-V, top-of-the-line text-to-speech, and some computer capture software, you can have someone like David Attenborough narrate everything that is happening in your life.
Create translations that follow your speech style. Meta has a new demo for seamless voice cloning and translation between languages. SeamlessExpressive is an AI model that aims to maintain expressive speech style elements in the translation
Vanna. Vanna is an MIT-licensed open-source Python RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality.
GRDBIS. Graph Relation Distillation for Efficient Biomedical Instance Segmentation
AQLM. Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization
RotationDrag. RotationDrag: Point-based Image Editing with Rotated Diffusion Features
AutoGGUF. GGUF is a format that allows many quantization methods and is used to run models with llama cpp. The quantization is automated by this notebook; it may not be effective for all models, but it is for the majority.
Listening with LLM. consolidate learnings on how to finetune Large Language Models (LLMs) to process audio, with the eventual goal of being able to build and host a LLM able to describe human voices.
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding. Generating customized, styled images is one of the most popular applications of generative picture models. Previously, DreamBooth or LoRA training was needed for this. Now, with just one picture and ID embeddings, you may significantly increase quality while lowering computing costs.
Content Consistent Super-Resolution. Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
FilCo. This repository contains the code and data about the project: Learning to Filter Context for Retrieval-Augmented Generation
haiku_dpo . Dataset to help align models to write correct Haiku’s.
sanity-checks-revisited. This repository contains the code and experiments for the paper Sanity Checks Revisited: An Exploration to Repair the Model Parameter Randomisation Test
MAGNeT. Masked Audio Generation using a Single Non-Autoregressive Transformer
Tiny Narrations. A text-to-speech read variant of the well-known (and compact) Tiny Stories dataset is called Tiny Narrations. On the SF Compute H100 cluster, it makes use of XTTS2.
Interconnects Tools for Multimodal Blogging!. Python tools for easily translating your blog content to podcasts & YouTube
ALMA: Advanced Language Model-based translator. ALMA (Advanced Language Model-based TrAnslator) is a many-to-many LLM-based translation model, which adopts a new translation model paradigm: it begins with fine-tuning monolingual data and is further optimized using high-quality parallel data. This two-step fine-tuning process ensures strong translation performance.
Privy. A privacy-first coding assistant.
UV-SAM: Adapting Segment Anything Model for Urban Village Identification. This work presents UV-SAM, a modified version of the Segment Anything Model, and the Vision Foundation Model that may be used to precisely locate urban village borders on satellite imagery. UV-SAM provides an effective substitute for conventional field surveys by integrating various image representations to achieve accurate detection.
ml-aim. We introduce AIM a collection of vision models pre-trained with an autoregressive generative objective.
compose-and-conquer. Official implementation of Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis. Excell for placing objects in three-dimensional space
Vlogger. we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches.
trapped-in-texture-bias. This is the official code release for the paper Trapped in texture bias. A large-scale comparison of deep instance segmentation
MegaDolphin-120b. MegaDolphin-2.2-120b is a transformation of Dolphin-2.2-70b
TACO(Topics in Algorithmic COde generation dataset). TACO (Topics in Algorithmic COde generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field.
AlphaFold found thousands of possible psychedelics. Will its predictions help drug discovery? Researchers have doubted how useful the AI protein-structure tool will be in discovering medicines — now they are learning how to deploy it effectively.

Perspectives

Link description
The Case for Cyborgs. Augmenting human intelligence beyond AI will take us much further than creating something new
Past, Present, and Future of AI with Vijay Pande. A forty-minute contemplation about AI featuring an outlook for the future.
AI Will Transform the Global Economy. Let’s Make Sure It Benefits Humanity. AI will affect almost 40 percent of jobs around the world, replacing some and complementing others. We need a careful balance of policies to tap its potential
AI is Not the Solution to All Our Educational Challenges. Empowering Students with an Immersive Mindset for Navigating an Unpredictable World
The Lazy Tyranny of the Wait Calculation. The "Wait Calculation" idea suggests holding off on undertaking certain tasks or going on a space mission to Barnard's Star until the technology has advanced enough to save a considerable amount of time and effort. This strategy must be weighed against the unpredictable nature of technological advancement and the possibility of learning losses.
What counts as plagiarism? Harvard president’s resignation sparks debate. Allegations against Claudine Gay have left researchers arguing over academic standards and practices.
‘Set it and forget it’: automated lab uses AI and robotics to improve proteins. A self-driving lab system spent half a year engineering enzymes to work at higher temperatures.
The consciousness wars: can scientists ever agree on how the mind works?. There are dozens of theories of how the brain produces conscious experience, and a new type of study is testing some of them head-to-head.
Centres of Excellence in AI for global health equity — a strategic vision for LMICs. We propose that Centres of Excellence should be established in low- and middle-income countries (LMICs) to enable artificial intelligence (AI) to deliver equity in health care.
Does generative AI help academics to do more or less?. UK academics use generative artificial intelligence (AI) in their work mainly because it improves task efficiency, saves time and labor, and boosts competitiveness
Evaluations Are All We Need. This essay examines the difficulties in assessing LLMs and contrasts them with assessments of employees conducted by humans. It addresses the challenge of gauging the practicality and intelligence of LLMs, emphasizing the shortcomings of existing assessment techniques and the demand for more efficient ones.
The Road To Honest AI. Identifying and modifying honesty-related vectors within the AI or employing unrelated questions to discover lying tendencies based on the AI's response consistency are two strategies suggested by recent studies to regulate AI honesty.

meme-of-the-week

Back to index

ML news: Week 8 - 14 January

Research

Link description
GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation. A human motion from a text framework named GUESS has been introduced. It reduces intricate human stances to more abstract forms on several levels, resulting in a more steady and condensed synthesis of motion from text.
Learning to Prompt with Text Only Supervision for Vision-Language Models. This project presents a technique to keep the generalization capabilities of CLIP-like vision-language models while adapting them for different tasks. Prompts are learned from LLM data, so labeled images are not necessary.
LLaVA-ϕ: Efficient Multi-Modal Assistant with Small Language Model. In this paper, we introduce LLaVA-ϕ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues.
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs. We introduce V*, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. The DeepSeek LLM was one of the greatest coding models available last year. In several benchmarks, it achieved closeness to GPT-3.5 (despite being probably three times larger). A technical study has been made public with details on model training, token counts, model architecture, and other topics.
Denoising Vision Transformers. The vision community has been overtaken by Vision Transformers (ViT). They occasionally still exhibit artifacts in their embeddings that resemble grids. The community is reluctant to use them for jobs that come after because of this. This study suggests a positional embedding update that fixes this problem and provides a 25%+ performance gain for downstream vision tasks.
FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF. A new stabilizer for smooth temporal coherence and GAN-NeRF technology for 3D consistency have been used by researchers to create a facial video editing architecture. This technique works well for editing videos since it keeps viewpoints constant and makes frame transitions smooth.
A Minimaximalist Approach to Reinforcement Learning from Human Feedback. Self-Play Preference Optimization (SPO), a less complex alignment method than conventional RLHF, has been presented by Google researchers. Using game theory, the researchers were able to develop single-player self-play dynamics that provide good performance and are resilient to noisy preferences.
Mixtral of Experts. We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation. The constraints of existing single-criterion measures have been addressed by researchers with the development of a new assessment metric for text-to-3D generative models. This sophisticated technique compares 3D objects and generates prompts using GPT-4V. It is very compatible with human tastes and provides flexibility by adjusting to different user-specified requirements.
Self-emerging Token Labeling. Using a novel self-emerging token labeling (STL) framework, researchers have made a substantial development for Vision Transformers (ViTs) by improving the resilience of the Fully Attentional Network (FAN) models. Using this method, a FAN student model is trained after a FAN token labeler has been trained to produce relevant patch token labels.
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. We propose a Multi-disciplinary Collaboration (MC) framework. The framework works in five stages: (i) expert gathering: gathering experts from distinct disciplines according to the clinical question; (ii) analysis proposition: domain experts put forward their own analysis with their expertise; (iii) report summarization: compose a summarized report on the basis of a previous series of analyses; (iv) collaborative consultation: engage the experts in discussions over the summarized report. The report will be revised iteratively until an agreement from all the experts is reached; (v) decision making: derive a final decision from the unanimous report.
DiffBody: Diffusion-based Pose and Shape Editing of Human Images. This study presents a one-shot approach to human image editing that allows for substantial body form and position modifications without compromising the subject's identification.
LLaMA Beyond English: An Empirical Study on Language Capability Transfer. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality.
Masked Audio Generation using a Single Non-Autoregressive Transformer. Most audio creation methods produce sounds by diffusion or an auto-regressive model. This study does not employ a complex Transformer or several stages. Rather, it employs an obscured language model on top of audio tokens.
TechGPT-2.0: A large language model project to solve the task of knowledge graph construction. TechGPT-2.0 improves on big language models for particular applications, such as building knowledge graphs. With its emphasis on relationship triple extra