# build, host and serve a 100% local OCR Vision AI using open source technologies

## Objective
build a 100% Local OCR Vision AI using open source code with complete automated workflow.

## What you'll learn
Hands on experience to build Gen AI RAG based Pro App, running 100% locally/hosted or API based, using API / open source tools.

## Tools

	Progamming: Python 3.12+
	LLM: Gemini or llama 3.2 or OpenAI ChatGPT or Anthropic
    Embeddings: mxbai-embed-larg or any embedding model
	Vector DB: ChromaDB or SQLLite or any VectorDB of your choice
	GUI App: Taipy or Open WebUI or Flutter
	IDE: Jupyter Lab
    OCR: Tesseract or PyPDF or Oracle or Azure Document Services or any API based vision AI

## Next Steps
	Agents based framework implementation, Enterprise ready
	- Langchain | LangGraphs or
	- swarms or
	- crewAI or
	- Microsoft Autogen | Semantics Kernels


## Disclaimer
In this blog post, I will demonstrate one automation use case I have been working on.

It's important to note that these use cases/models will work best when trained on "in-house" data. However, training such models is a rigorous task that requires significant computing hours and resources.
  
To make things more accessible and easier to utilize in production, using "off the shelf", language models like ChatGPT and Llama 3.2 is a viable solution.

*While these examples are not meant for production*, they still showcase the powerful capabilities of the language models.

## Process Flow

```mermaid
stateDiagram-v2
        direction LRstateDiagram-v2
        [*] --> User_Query
        User_Query --> Conversation_AI_Agent
        Conversation_AI_Agent --> SQL_DB
        SQL_DB --> Conversation_AI_Agent
        Conversation_AI_Agent --> RAG_VectorDB
        RAG_VectorDB --> Conversation_AI_Agent
        [*] --> File_Drop
        File_Drop --> PyPDF
        File_Drop --> Tesseract
        File_Drop --> AzureDocementService
        File_Drop --> OracleVisionAI
        File_Drop --> OtherVisionAPI
        PyPDF --> QC
        Tesseract --> QC
        AzureDocementService --> QC
        OracleVisionAI --> QC
        OtherVisionAPI --> QC
        QC --> QC_AI_Agent
        QC_AI_Agent --> QC
        QC_AI_Agent --> RAG_VectorDB
        QC_AI_Agent --> SQL_DB
        QC_AI_Agent --> [*]
        Conversation_AI_Agent --> [*]

%% Define classes for coloring
    classDef red fill:#ff8,stroke:#333,stroke-width:2px;
    classDef green fill:#8fa,stroke:#333,stroke-width:2px;
    classDef blue fill:#8af,stroke:#333,stroke-width:2px;
    classDef orange fill:#f92,stroke:#333,stroke-width:2px;
    classDef brown fill:#e6f,stroke:#333,stroke-width:2px;
    classDef neil fill:#1ff,stroke:#333,stroke-width:2px;

    %% Apply classes to states
    class User_Query green
    class Conversation_AI_Agent orange
    class social green
    class Tesseract brown
    class SQL_DB blue
    class File_Drop green
    class RAG_VectorDB blue
    class PyPDF brown
    class QC_AI_Agent orange
    class QC red
    class AzureDocementService brown
    class OracleVisionAI brown
    class OtherVisionAPI brown
```

# code

## Activate AI on File Drop - Setup CRON job

In [None]:
# Create a Shell Script: OCR_script.sh
# create a shell script that will check for the file and execute the Python script when the file is found.

#!/bin/bash

WATCHED_DIR="/path/to/watched/directory"
FILE_NAME="yourfile.txt"
PYTHON_SCRIPT="/path/to/your/OCR_Agent.py"

if [ -f "$WATCHED_DIR/$FILE_NAME" ]; then
    python3 $PYTHON_SCRIPT
    # Optionally, move or delete the file after processing
    mv "$WATCHED_DIR/$FILE_NAME" "$WATCHED_DIR/processed/"
fi


In [None]:
# Make the Shell Script Executable:
# $ chmod +x /path/to/your/OCR_script.sh

In [None]:
# Set Up the Cron Job: Open the crontab editor:
# $ crontab -e

# add this line to crontab
# * * * * * /path/to/your/OCR_script.sh

In [None]:
# OCR_Agent.py
with open("/path/to/output/file.txt", "w") as f:
    f.write("File has been processed.")

# OCR - read file content

## Approach 1: PyPDF

In [None]:
# !pip install pypdf
# !curl -O https://github.com/AmitXShukla/RPA/blob/main/SampleData/The%20Ultimate%20Guide%20to%20Data%20Wrangling%20with%20Python%20-%20Rust%20Polars%20Data%20Frame.pdf

In [None]:
from pypdf import PdfReader

reader = PdfReader("../downloads/Python - understanding functions.pdf")
number_of_pages = len(reader.pages)
text = ''.join([page.extract_text() for page in reader.pages])
print(text[:2155])

import ollama

data =""
prompt = "how are args and kwargs different in python"
import ollama

def get_completion(prompt):
    output = ollama.generate(
        model="llama3.1",
        prompt=f"""answer this question : {prompt}"""
        )
    return output["response"]  # type: ignore

completion = get_completion(
    f"""Here is a local guide: <guide>{text}</guide>    

Please do the following:
1. Summarize the abstract about Python args 
and keyword args understanding at a kindergarten reading level. (In <kindergarten_abstract> tags.)
2. Write the Methods section as a recipe from the Moosewood Cookbook. (In <moosewood_methods> tags.)
"""
)
print(completion)

## Approach 2: Tesseract

To read text from images using Tesseract OCR in Python, we can use the pytesseract library, which is a Python wrapper for the Tesseract OCR engine. Here's an example code snippet:

[download tesseract here](https://tesseract-ocr.github.io/tessdoc/#binaries)

`Note that Tesseract OCR is not perfect and may not be able to extract text accurately from all images.`

In [None]:
# py -m pip install pytesseract PIL

In [None]:
from PIL import Image
img = Image.open('../downloads/AAPL.png')
img.show()

# make sure, you have tesseract included in your environment path

import os
os.getenv("tesseract")

In [None]:
import pytesseract
from PIL import Image

##############################################################################
# in case if tesseract is not included in PATH
pytesseract.pytesseract.tesseract_cmd = r'C:\amit.la\WIP\RPA\downloads\ts\tesseract.exe'
##############################################################################

def read_image_text(image_path):
    """
    Reads text from an image file using Tesseract OCR.

    Args:
        image_path (str): The file path to the input image.

    Returns:
        str: The extracted text from the image.
    """
    # Load the image file
    image = Image.open(image_path)

    # Use Tesseract OCR to extract the text from the image
    text = pytesseract.image_to_string(image)

    return text

# Example usage
image_path = "../downloads/APPL.png"
text = read_image_text(image_path)
print(text)

In [None]:
images = {
        "AAPL": "../downloads/AAPL.png",
        "ORCL": "../downloads/ORCL.png",
        "TSLA": "../downloads/TSLA.png",
        "GOOG": "../downloads/GOOG.png",
        "MSFT": "../downloads/MSFT.png"
    }

# automate reading images and creating text from these images
# you can further store these texts into a database

for key,value in images.items():
    # print(key, value)
    text = read_image_text(value)
    print(text)

## Approach 3: using Microsoft Document Services API

In [None]:
"""
This code sample shows Prebuilt Read operations with the Azure Form Recognizer client library. 
The async versions of the samples require Python 3.6 or later.

To learn more, please visit the documentation - Quickstart: Form Recognizer Python client library SDKs
https://learn.microsoft.com/azure/applied-ai-services/form-recognizer/quickstarts/get-started-v3-sdk-rest-api?view=doc-intel-3.1.0&pivots=programming-language-python
"""

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see 
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

def format_bounding_box(bounding_box):
    if not bounding_box:
        return "N/A"
    return ", ".join(["[{}, {}]".format(p.x, p.y) for p in bounding_box])

def analyze_read():
    # sample document
    formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"

    document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )
    
    poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", formUrl)
    result = poller.result()

    print ("Document contains content: ", result.content)
    
    for idx, style in enumerate(result.styles):
        print(
            "Document contains {} content".format(
                "handwritten" if style.is_handwritten else "no handwritten"
            )
        )

    for page in result.pages:
        print("----Analyzing Read from page #{}----".format(page.page_number))
        print(
            "Page has width: {} and height: {}, measured with unit: {}".format(
                page.width, page.height, page.unit
            )
        )

        for line_idx, line in enumerate(page.lines):
            print(
                "...Line # {} has text content '{}' within bounding box '{}'".format(
                    line_idx,
                    line.content,
                    format_bounding_box(line.polygon),
                )
            )

        for word in page.words:
            print(
                "...Word '{}' has a confidence of {}".format(
                    word.content, word.confidence
                )
            )

    print("----------------------------------------")


if __name__ == "__main__":
    analyze_read()


## Approach 4: using Oracle Vision API
## Approach 5: using Other Vision API
## Approach 6*: future - using ollama llama3.2 multimodal

# QC Agent to validate OCR Quality

We need to implement a more sophisticated agent-based framework to create an Enterprise QC Agents interface. 
We'll revisit this topic in future blogs, where we'll explore agents in greater detail.

	Agents based framework implementation, Enterprise ready
	- Langchain | LangGraphs or
	- swarms or
	- crewAI or
	- Microsoft Autogen | Semantics Kernels

For the time being, we'll utilize a straightforward GenAI Chat interface to share content and evaluate whether the LLM model considers it valid.

In [None]:
import ollama

data = "" # data read from OCR

def get_completion(prompt):
    output = ollama.generate(
        model = "llama3.2",
        prompt = prompt
        )
    return output["response"]  # type: ignore

completion = get_completion(
    f"""Here is a local input: <guide>{data}</guide>

Please do the following:
Local input is being received from an OCR reader, and I want to assess whether this data appears to be a reasonable extract from a document.

"""
)
print(completion)

# Push data to SQL DB and RAG Vector DB

In [None]:
import sqlite3

# Connect to the database (or create it)
conn = sqlite3.connect('example.db')
cursor = conn.cursor()

# Create a table
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY,
    name TEXT,
    age INTEGER
)
''')
conn.commit()

# Insert a record
cursor.execute('''
INSERT INTO users (name, age) VALUES (?, ?)
''', ('Alice Wonder', 30))
conn.commit()

# Retrieve records
cursor.execute('SELECT * FROM users')
rows = cursor.fetchall()
for row in rows:
    print(row)

# Close the connection
conn.close()

In [None]:
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
def get_user(employee):
  # Insert a record
  # cursor.execute('''
  # INSERT INTO users (name, age) VALUES (?, ?)
  # ''', ('Alice Wonder', 30))
  # conn.commit()
  print(f"SELECT * FROM users where name = {employee}")
  cursor.execute(f"SELECT * FROM users where name = '{employee}'")
  rows = cursor.fetchall()
  # for row in rows:
  #   print(row)
  return rows

print(get_user("Alice"))

# Close the connection
# conn.close()

# create Chat with Tools/Function calling to query SQL and RAG DBs

In [None]:
import ollama

tools = [{
      'type': 'function',
      'function': {
        'name': 'get_current_weather',
        'description': 'Get the current weather for a city',
        'parameters': {
          'type': 'object',
          'properties': {
            'city': {
              'type': 'string',
              'description': 'The name of the city',
            },
          },
          'required': ['city'],
        },
      },
    },
    {
      'type': 'function',
      'function': {
        'name': 'get_user',
        'description': 'Get the current age of employee',
        'parameters': {
          'type': 'object',
          'properties': {
            'employee': {
              'type': 'string',
              'description': 'The name of the employee',
            },
          },
          'required': ['employee'],
        },
      },
    },
  ]

# creating a generic function to call appropriate tool based on tool input
def process_tool_call(tool_name, tool_input):
    if tool_name == "get_current_weather":
        return get_current_weather(tool_input["city"])
    if tool_name == "get_user":
        return get_user(tool_input["employee"])
  
# print(process_tool_call('get_current_weather', {'city': 'Los Angeles CA'}))
print(process_tool_call('get_user', {'employee': 'Alice'}))

In [None]:
response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 
            'how old is Alice?'}],
		    # provide a weather checking tool to the model
        tools=tools # type: ignore
    )

# response 
print(f"\nInitial Response:")
print(f"Tool called: {response["message"]["tool_calls"][0]}")
print(f"Tool name: {response["message"]["tool_calls"][0]["function"]["name"]}")
print(f"Tool param: {response["message"]["tool_calls"][0]["function"]["arguments"]}")
print(f"Stop Reason: {response["done_reason"]}")
print(f"Content: {response["message"]["content"]}")

In [None]:
response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 
            'What is the weather in Los Angeles CA today?'}],
		    # provide a weather checking tool to the model
        tools=tools # type: ignore
    )

# response 
print(f"\nInitial Response:")
print(f"Tool called: {response["message"]["tool_calls"][0]}")
print(f"Tool name: {response["message"]["tool_calls"][0]["function"]["name"]}")
print(f"Tool param: {response["message"]["tool_calls"][0]["function"]["arguments"]}")
print(f"Stop Reason: {response["done_reason"]}")
print(f"Content: {response["message"]["content"]}")

In [None]:
def chatBot(user_message):
    print(f"\n{'='*50}\nUser Message: {user_message}\n{'='*50}")
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': user_message}],
		    # provide a weather checking tool to the model
        tools=tools # type: ignore
    )
    print(f"\nInitial Response:")
    print(f"Tool called: {response["message"]["tool_calls"][0]}")
    print(f"Stop Reason: {response["done_reason"]}")
    print(f"Content: {response["message"]["content"]}")

    if response["done_reason"] == "stop":
        # tool_use = next(block for block in response.content if block.type == "tool_use")
        tool_name = response["message"]["tool_calls"][0]["function"]["name"]
        tool_input = response["message"]["tool_calls"][0]["function"]["arguments"]
        tool_content = response["message"]["content"]

        tool_result = process_tool_call(tool_name, tool_input)
        print(f"Tool Result: {tool_result}")

        response = ollama.chat(
                model='llama3.2',
                messages=[
                    {"role": "user", "content": user_message},
                    # {"role": "assistant", "content": f"as per results from tools API, current data is {str(tool_result)} , based on this data, please answer this {user_message}."},
                    {
                    "role": "tool",
                    "content": str(tool_result) # type: ignore
                    },
                ],
                tools=tools # type: ignore
                )
        print(response)
    return response

In [None]:
chatBot("How is the weather in San Francisco today?")
# chatBot("How old is my employee name Alice?")

# build a Chat APP 

## build, host, serve GenAI RAG app using llama3.2, ollama, ChromaDB SQLite, Taipy, openwebui

please visit this [X-Article](https://x.com/ashuklax/status/1854404956075217330) or visit this youtube [tutorial](https://www.youtube.com/playlist?list=PLp0TENYyY8lF8EsgtfDoPkuAgxc-lcwbd).