Lists (32)
Sort Name ascending (A-Z)
Alfred
AnomalyDetection
Avatar
👨🏻💻CodingTools
📷CV
🕸DeepLearning
Diffusion
🌏FQ
🤖GPT
Graph
🧑🏫 Guide Book
🦍Large Models
📋List
LLM Agents
LLM Apps
🗃️LLM Memory
LLM Tools
💻Mac
🤖MachineLearning
🛠MiscTools
📝NLP
OCR
📖PaperCode
📱 Phone
🔬ResearchTool
📶Router
🔭ScientificTools
🗄Server
🗣️Spech
📚Study
🖥Windows
Zotero
Starred repositories
PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books. The project has just started.
Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high …
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
Toolkit for linearizing PDFs for LLM datasets/training
百聆 是一个类似GPT-4o的语音对话机器人,通过ASR+LLM+TTS实现,集成DeepSeek R1等优秀大模型,时延低至800ms,Mac等低配置也可运行,支持打断
Convert any PDF into a podcast episode!
Python tool for converting files and office documents to Markdown.
OCR & Document Extraction using vision models
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
ChatTTS 2000条音色稳定性打分🥇+区分男女年龄👧+在线试听🔈 ChatTTS 2K Speaker Stability Score & Categorized by Gender and Age & Audio Preview
🍦 Speech-AI-Forge is a project developed around TTS generation model, implementing an API Server and a Gradio-based WebUI.
Finetune Llama 3.3, DeepSeek-R1, Gemma 3 & Reasoning LLMs 2x faster with 70% less memory! 🦥
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
[CVPR 2023] SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
[CVPR 2025] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Get your documents ready for gen AI
A generative speech model for daily dialogue.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Detect and extract tables to markdown and csv
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
Label Studio is a multi-type data labeling and annotation tool with standardized output format
an unofficial code for augment-XY-CUT in XYLayoutLM
A Faster LayoutReader Model based on LayoutLMv3, Sort OCR bboxes to reading order.
Use LLMs to dig out what you care about from massive amounts of information and a variety of sources daily.
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
OCR, layout analysis, reading order, table recognition in 90+ languages