# Welcome to :snowflake: OCR and RAG with Snowflake Notebooks :notebook:

Transform your images into searchable text and build a question-answering system using OCR and RAG capabilities in [Snowflake Notebooks](https://docs.snowflake.com/LIMITEDACCESS/snowsight-notebooks/ui-snowsight-notebooks-about)! ⚡️

This notebook demonstrates how to build an end-to-end application that:
1. Performs OCR on images using Tesseract
2. Stores and indexes the extracted text
3. Creates a question-answering interface using Snowflake's Cortex capabilities and Streamlit

TODO: Add quickstart link, youtube link


## Before you start 🛑

Use the [quickstart](URL HERE) guide to prepare and load images into staging. 

## Setting Up Your Environment 🎒

First, we'll import the required packages and set up our Snowflake session. The notebook uses several key packages:
- `streamlit`: For creating the interactive web interface
- `tesserocr`: For optical character recognition
- `pandas`: For data manipulation
- `PIL`: For image processing
- Snowpark packages for interacting with Snowflake


In [None]:
# Import python packages
import streamlit as st
import tesserocr
import io
import pandas as pd
from PIL import Image

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.types import StringType, StructField, StructType, IntegerType
from snowflake.snowpark.files import SnowflakeFile
from snowflake.core import CreateMode
from snowflake.core.table import Table, TableColumn
from snowflake.core.schema import Schema
from snowflake.core import Root

session = get_active_session()
session.use_schema("ocr_rag")
root = Root(session)
database = root.databases[session.get_current_database()]

## Creating the Schema and Table Structure 🏗️

Next, we'll set up our database schema and table to store the processed documents. The `docs_chunks_table` will store:
- File paths and URLs for our images
- Extracted text chunks from OCR
- Vector embeddings for semantic search


In [None]:
docs_chunks_table = Table(
    name="docs_chunks_table",
    columns=[TableColumn(name="relative_path", datatype="string"),
            TableColumn(name="file_url", datatype="string"),
            TableColumn(name="scoped_file_url", datatype="string"),
            TableColumn(name="chunk", datatype="string"),
            TableColumn(name="chunk_vec", datatype="vector(float,768)")]
)
database.schemas["ocr_rag"].tables.create(docs_chunks_table, mode=CreateMode.or_replace)

## OCR Processing with Tesseract 📸

Now we'll create a User-Defined Table Function (UDTF) that:
1. Reads images from Snowflake storage
2. Processes them with Tesseract OCR
3. Returns the extracted text

This function handles binary image data and can process multiple image formats.


In [None]:
session.sql("DROP FUNCTION IF EXISTS IMAGE_TEXT(VARCHAR)").collect()

class ImageText:
    def process(self, file_url: str):
        with SnowflakeFile.open(file_url, 'rb') as f:
            buffer = io.BytesIO(f.readall())
        image = Image.open(buffer)
        text = tesserocr.image_to_text(image)
        yield (text,)  # Return the full OCR text

output_schema = StructType([StructField("full_text", StringType())])

session.udtf.register(
    ImageText,
    name="IMAGE_TEXT",
    is_permanent=True,
    stage_location="@ocr_rag.images_to_ocr",
    schema="ocr_rag",
    output_schema=output_schema,
    packages=["tesserocr", "pillow","snowflake-snowpark-python"],
    replace=True
)

## Processing Images and Extracting Text 🔄

Let's process our staged images through the OCR function. This query will:
1. Read all images from our stage
2. Run OCR on each image
3. Return the extracted text along with file information


In [None]:
SELECT 
    relative_path, 
    file_url, 
    build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path) AS scoped_file_url,
    ocr_result.full_text
FROM 
    directory(@ocr_rag.images_to_ocr),
    TABLE(IMAGE_TEXT(build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path))) AS ocr_result;

## Text Processing and Vectorization 🔤

Now we'll process the extracted text by:
1. Splitting it into manageable chunks
2. Creating vector embeddings using Snowflake Cortex
3. Storing the results for efficient retrieval


In [None]:
INSERT INTO docs_chunks_table (relative_path, file_url, scoped_file_url, chunk, chunk_vec)
SELECT 
    relative_path, 
    file_url,
    scoped_file_url,
    chunk.value,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk.value) AS chunk_vec
From
    {{run_through_files_to_ocr}},
    LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(full_text,'none', 4000, 400)) chunk;

## Building the Question-Answering System 🤖

Finally, we'll create our QA system that uses:
- Vector similarity search to find relevant context
- Mistral-7b model for generating answers
- Streamlit for the user interface

Key parameters:
- `num_chunks`: Number of context chunks provided (default: 3)
- `model`: Language model used (default: "mistral-7b")


In [None]:
num_chunks = 3 # Num-chunks provided as context. Play with this to check how it affects your accuracy
model = "mistral-7b" # The model we decided to use
def create_prompt (myquestion):
    cmd = """
     with results as
     (SELECT RELATIVE_PATH,
       VECTOR_COSINE_SIMILARITY(docs_chunks_schema.docs_chunks_table.chunk_vec,
                SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,
       chunk
     from docs_chunks_schema.docs_chunks_table
     order by similarity desc
     limit ?)
     select chunk, relative_path from results 
     """
    df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()      

    context_lenght = len(df_context) -1
    prompt_context = ""
    for i in range (0, context_lenght):
        prompt_context += df_context._get_value(i, 'CHUNK')
    prompt_context = prompt_context.replace("'", "")
    relative_path =  df_context._get_value(0,'RELATIVE_PATH')
    prompt = f"""
      'You are an expert assistance extracting information from context provided. 
       Answer the question based on the context. Be concise and do not hallucinate. 
       If you don´t have the information just say so.
      Context: {prompt_context}
      Question:  
       {myquestion} 
       Answer: '
       """
    cmd2 = f"select GET_PRESIGNED_URL(@ocr_rag.images_to_ocr, '{relative_path}', 360) as URL_LINK from directory(@ocr_rag.images_to_ocr)"
    df_url_link = session.sql(cmd2).to_pandas()
    url_link = df_url_link._get_value(0,'URL_LINK')

    return prompt, url_link, relative_path
def complete(myquestion, model_name):
    prompt, url_link, relative_path =create_prompt (myquestion)
    cmd = f"""
             select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response
           """

    df_response = session.sql(cmd, params=[model_name, prompt]).collect()
    return df_response, url_link, relative_path
def display_response (question, model):
    response, url_link, relative_path = complete(question, model)
    res_text = response[0].RESPONSE
    st.markdown(res_text)

    display_url = f"Link to [{relative_path}]({url_link}) that may be useful"
    st.markdown(display_url)
#Main code
st.title("Asking Questions to Your Scanned Documents with Snowflake Cortex:")
docs_available = session.sql("ls @ocr_rag.images_to_ocr").collect()
question = st.text_input("Enter question", placeholder="What are my documents about?", label_visibility="collapsed")
if question:
    display_response (question, model)

## Performance Tips 🚀

To get the best results from your OCR and RAG system:

1. Fine-tune your parameters:
   - Adjust `num_chunks` based on your document length and complexity
   - Experiment with different chunk sizes for optimal context
   - Monitor response quality and adjust as needed

2. Optimize image processing:
   - Ensure good image quality for better OCR results
   - Consider batch processing for large image sets
   - Pre-process images if needed (contrast, resolution)

3. Manage resources efficiently:
   - Monitor memory usage with large document sets
   - Use appropriate vector similarity thresholds
   - Consider indexing strategies for large collections


## Security Considerations 🔒

Important security aspects to keep in mind:

1. Access Control:
   - Ensure proper permissions on image storage
   - Manage API access appropriately
   - Monitor usage patterns and access logs

2. Data Privacy:
   - Consider sensitivity of extracted text
   - Implement appropriate data retention policies
   - Handle user queries securely


## Going Further 🌟

Consider extending the application with:

1. Enhanced Features:
   - Multiple language support
   - Custom OCR pre-processing
   - Additional document formats

2. UI Improvements:
   - Enhanced visualization of results
   - Better context highlighting
   - User feedback mechanisms

3. Model Enhancements:
   - Custom embedding models
   - Fine-tuned language models
   - Improved prompt engineering

For more information on Snowflake's AI capabilities, visit the [Snowflake documentation](https://docs.snowflake.com/).
