## [OmniParse](https://github.com/adithya-s-k/omniparse)
Seamlessly ingest any data and get structured, actionable output.

![OmniParse](https://raw.githubusercontent.com/adithya-s-k/omniparse/main/docs/assets/hero_image.png)
[![GitHub Stars](https://img.shields.io/github/stars/adithya-s-k/omniparse?style=social)](https://github.com/adithya-s-k/omniparse/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/adithya-s-k/omniparse?style=social)](https://github.com/adithya-s-k/omniparse/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/adithya-s-k/omniparse)](https://github.com/adithya-s-k/omniparse/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/adithya-s-k/omniparse)](https://github.com/adithya-s-k/omniparse/pulls)
[![License](https://img.shields.io/github/license/adithya-s-k/omniparse)](https://github.com/adithya-s-k/omniparse/blob/main/LICENSE)

| Original PDF                                                                                                                                                                               | OmniParse-API                                                                                                                                                                           | PyPDF                                                                                                                                                               |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [![Original PDF](https://github.com/adithya-s-k/marker-api/raw/master/data/images/original\_pdf.png)](https://github.com/adithya-s-k/marker-api/blob/master/data/images/original\_pdf.png) | [![OmniParse-API](https://github.com/adithya-s-k/marker-api/raw/master/data/images/marker\_api.png)](https://github.com/adithya-s-k/marker-api/blob/master/data/images/marker\_api.png) | [![PyPDF](https://github.com/adithya-s-k/marker-api/raw/master/data/images/pypdf.png)](https://github.com/adithya-s-k/marker-api/blob/master/data/images/pypdf.png) |

## Features
✅ Completely local, no external APIs  
✅ Supports 10+ file types  
✅ Convert documents, multimedia, and web pages to high-quality structured markdown  
✅ Table extraction, image extraction/captioning, audio/video transcription, web page crawling  
✅ Easily deployable using Docker and Skypilot  
✅ Colab friendly  

### Problem Statement:
It's challenging to process data as it comes in different shapes and sizes. OmniParse aims to be an ingestion/parsing platform where you can ingest any type of data, such as documents, images, audio, video, and web content, and get the most structured and actionable output that is GenAI (LLM) friendly.

## Coming Soon
⭐ Dynamic chunking and structured data extraction based on specified Schema
🛠️ One magic API: just feed in your file prompt what you want, and we will take care of the rest  
🔧 Dynamic model selection and support for external APIs  
📄 Batch processing for handling multiple files at once  
🦙 New open-source model to replace Surya OCR and Marker  

Final goal - replace all the different models currently being used with a single MultiModel Model to parse any type of data and get the data you need

📄 - [Documentation](https://docs.cognitivelab.in/) \
Created by [Adithya](https://x.com/adithya_s_k).

In [None]:
## Clone the repository

!git clone https://github.com/adithya-s-k/omniparse.git
%cd omniparse
%pwd

In [None]:
## Install dependencies

%pip install -e .

In [None]:
# Update and install necessary packages
!apt-get update && apt-get install -y --no-install-recommends \
    wget \
    curl \
    unzip \
    git \
    libgl1 \
    libglib2.0-0 \
    curl \
    gnupg2 \
    ca-certificates \
    apt-transport-https \
    software-properties-common \
    libreoffice \
    ffmpeg \
    git-lfs \
    xvfb \
    && ln -s /usr/bin/python3 /usr/bin/python \
    && curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash \
    && wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && echo "deb http://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list \
    && apt-get update \
    && apt install python3-packaging \
    && apt-get install -y --no-install-recommends google-chrome-stable \
    && rm -rf /var/lib/apt/lists/*

# Download and install ChromeDriver
!CHROMEDRIVER_VERSION=$(curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE) && \
    wget -N https://chromedriver.storage.googleapis.com/$CHROMEDRIVER_VERSION/chromedriver_linux64.zip -P /tmp && \
    unzip /tmp/chromedriver_linux64.zip -d /tmp && \
    mv /tmp/chromedriver /usr/local/bin/chromedriver && \
    chmod +x /usr/local/bin/chromedriver && \
    rm /tmp/chromedriver_linux64.zip

# Set environment variables
import os
os.environ['CHROME_BIN'] = '/usr/bin/google-chrome'
os.environ['CHROMEDRIVER'] = '/usr/local/bin/chromedriver'
os.environ['DISPLAY'] = ':99'
os.environ['DBUS_SESSION_BUS_ADDRESS'] = '/dev/null'
os.environ['PYTHONUNBUFFERED'] = '1'

print("✅ Set up complete")

### Using Cloudflare tunnels (Recommended)
After the server is set up and cloudflare is available please go to /docs to access all the api endpoints

In [None]:
!wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!dpkg -i cloudflared-linux-amd64.deb

import subprocess
import threading
import time
import socket
import urllib.request

def iframe_thread(port):
  while True:
      time.sleep(0.5)
      sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      result = sock.connect_ex(('127.0.0.1', port))
      if result == 0:
        break
      sock.close()
  print("\nOmniPrase API finished loading, trying to launch cloudflared (if it gets stuck here cloudflared is having issues)\n")

  p = subprocess.Popen(["cloudflared", "tunnel", "--url", "http://127.0.0.1:{}".format(port)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  for line in p.stderr:
    l = line.decode()
    if "trycloudflare.com " in l:
      print("This is the URL to access OmniPrase API:", l[l.find("http"):], end='')
    #print(l, end='')


threading.Thread(target=iframe_thread, daemon=True, args=(8000,)).start()

!python server.py --host 127.0.0.1 --port 8000 --documents --media --web

--2024-06-27 16:48:21--  https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/cloudflare/cloudflared/releases/download/2024.6.1/cloudflared-linux-amd64.deb [following]
--2024-06-27 16:48:21--  https://github.com/cloudflare/cloudflared/releases/download/2024.6.1/cloudflared-linux-amd64.deb
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/106867604/3e345268-c5d6-4324-8389-71790dcf95ac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20240627%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240627T164822Z&X-Amz-Expires=300&X-Amz-Signature=df43d9283a45d3bd808101c655a5ec997a3bf8508bcea7fb675f75d0

### Forward using localtunnel
only use if Clourflare tunnel is not working properly

In [None]:
!npm install -g localtunnel

import subprocess
import threading
import time
import socket
import urllib.request

def iframe_thread(port):
  while True:
      time.sleep(0.5)
      sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      result = sock.connect_ex(('127.0.0.1', port))
      if result == 0:
        break
      sock.close()
  print("\Omniparse finished loading, trying to launch localtunnel (if it gets stuck here localtunnel is having issues)\n")

  print("The password/enpoint ip for localtunnel is:", urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))
  p = subprocess.Popen(["lt", "--port", "{}".format(port)], stdout=subprocess.PIPE)
  for line in p.stdout:
    print(line.decode(), end='')


threading.Thread(target=iframe_thread, daemon=True, args=(8000,)).start()

!python server.py --host 127.0.0.1 --port 8000 --documents --media --web