Learn how multimodal AI merges text, image, and audio for smarter models
-
Updated
Jan 21, 2025 - Jupyter Notebook
Learn how multimodal AI merges text, image, and audio for smarter models
Neocortex Unity SDK for Smart NPCs and Virtual Assistants
AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.
Enterprise-ready solution leveraging multimodal Generative AI (Gen AI) to enhance existing or new applications beyond text—implementing RAG, image classification, video analysis, and advanced image embeddings.
AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.
#3 Winner of Best Use of Zoom API at Stanford TreeHacks 2024! An AI-powered meeting assistant that captures video, audio and textual context from Zoom calls using multimodal RAG.
VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.
🔥 The first survey on bridging VLMs and synthetic data, for which I completed the entire process of reading 125 papers and writing the research paper in just 10 days.
A hands-on collection of experimental AI mini-projects exploring large language models, multimodal reasoning, retrieval-augmented generation (RAG), reinforcement learning, and real-world applications in finance, eKYC, and voice interfaces.
A demo multimodal AI chat application built with Streamlit and Google's Gemini model. Features include: secure Google OAuth, persistent data storage with Cloud SQL (PostgreSQL), and intelligent function calling. Includes a persona-based newsletter engine to deliver personalized insights.
Gallery showcasing AI-generated images and videos created using the Nova model
This repository contains code for fine-tuning Google's PaliGemma vision-language model on the Flickr8k dataset for image captioning tasks
Generative AI (Gen AI) is a branch of artificial intelligence that creates new content such as text, images, audio, or code using models like GPT or Gemini. It powers applications like AI chatbots, image generation tools, and creative assistants across various industries.
Lab website
Ai FnB Service & Menu Training Assistant Powered by Gemini & Google Cloud
Apsara 2.5: Evolution from Langchain to Google Gemini API with multimodal capabilities, URL context analysis, and integrated tools for chat, voice, and visual interactions.
The teaches you to integrate text, images, and videos into applications using Gemini's state-of-the-art multimodal models. Learn advanced prompting techniques, cross-modal reasoning, and how to extend Gemini's capabilities with real-time data and API integration.
Multi-modal AI system for diagnosing respiratory diseases using Vision Transformers and BERT.
AI-based product condition detection using BLIP-2 + FastAPI + Phi-4 (Ollama)
🤖🤖 Gemini-Powered AI Chatbot 🤖🤖This is a Streamlit-based AI chatbot powered by Google Gemini models (1.5 Pro & 1.5 Flash). The chatbot supports both text and image input, making it capable of handling multimodal queries. It's perfect for experimenting with Google's generative AI capabilities through a clean, interactive web interface.
Add a description, image, and links to the multimodal-ai topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-ai topic, visit your repo's landing page and select "manage topics."