<a href="https://colab.research.google.com/github/AreebAhmad-02/Embedding-Models-Finetuning/blob/main/OCR_of_Scanned_PDF_Using_Google_VIsion_API_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Google Vision API for OCR of PDF

### Using google vision api for Reading scanned pages from PDF

In [1]:
!pip install google-cloud-storage



In [2]:
!pip install --upgrade google-cloud-vision



In [3]:
!pip freeze

absl-py==1.4.0
aiohttp==3.9.5
aiosignal==1.3.1
alabaster==0.7.16
albumentations==1.3.1
altair==4.2.2
annotated-types==0.7.0
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.5.1
arviz==0.15.1
astropy==5.3.4
astunparse==1.6.3
async-timeout==4.0.3
atpublic==4.1.0
attrs==23.2.0
audioread==3.0.1
autograd==1.6.2
Babel==2.15.0
backcall==0.2.0
beautifulsoup4==4.12.3
bidict==0.23.1
bigframes==1.8.0
bleach==6.1.0
blinker==1.4
blis==0.7.11
blosc2==2.0.0
bokeh==3.3.4
bqplot==0.12.43
branca==0.7.2
build==1.2.1
CacheControl==0.14.0
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.6.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
chex==0.1.86
click==8.1.7
click-plugins==1.1.1
cligj==0.7.2
cloudpathlib==0.18.1
cloudpickle==2.2.1
cmake==3.27.9
cmdstanpy==1.2.3
colorcet==3.1.0
colorlover==0.3.0
colour==0.1.5
community==1.0.0b1
confection==0.1.5
cons==0.4.6
contextlib2==21.6.0
contourpy==1.2.1
cryptography==42.0.8
cuda-python==12.2.1
cudf-cu12 @ https://pypi.nvidia.c

In [8]:
import os
# Replace 'YOUR_SERVICE_ACCOUNT_KEY.json' with the actual path to your service account key.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/image-project-123-494614a0d191.json"

#### This code below utilizes Google Cloud Vision API to asynchronously extract text from a PDF document stored in Google Cloud Storage (GCS). The async_detect_document function configures the process by specifying the document type (PDF), creating batches for processing (100 pages each), defining the output location for JSON files containing text locations in GCS, and initiating the asynchronous request.



In [5]:
import json
import re
from google.cloud import vision
from google.cloud import storage

In [14]:
#Supported mime_types are: 'application/pdf' and 'umage/tiff'
def async_detect_document(gcs_source_uri, gcs_destination_uri):

  #Because we are reading from a PDF document, we set this variable
  #to application/pdf.

  #You could also do the same operation with images which you might have
  #pre-processed to make them easier to read.

  mime_type='application/pdf'

  ##Batch size determines how many PDF pages worth of data will go in
  ##each file of text locations

  batch_size = 100

  #We are using a tool which annotates where there is text in an image/pdf
  client = vision.ImageAnnotatorClient()

  feature = vision.Feature(
      type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

  #Here, we tell the Cloud Vision API that our source type is mime_type
  #aka, a PDF, and where that PDF is foudn -gcs source_uri

  gcs_source = vision.GcsSource(uri=gcs_source_uri)
  input_config = vision.InputConfig(
                                                                                gcs_source=gcs_source, mime_type=mime_type)

  #This chunk of code says we will be creating JSON files with
  #106 pages worth of annotation data each, and that we will

  #put these files in gcs_destination_url
  gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)

  output_config = vision.OutputConfig(
      gcs_destination=gcs_destination, batch_size=batch_size)

  #we are making an asynchronous request using the input and output
  #configurations we just set up in the last two chunks

  async_request = vision.AsyncAnnotateFileRequest(
      features=[feature], input_config=input_config,
      output_config=output_config)

  #The operation we will be running will asynchronously batch-annotate files
  #using the client and asyn_request we set up earlier

  operation = client.async_batch_annotate_files(
      requests=[async_request])

  #Now that we've configured how we want to annotate the files,
  #we finaly run the operation.

  #We are setting a timeout so that it doesn't run forever, using up resources
  #if the task we assigned is too big or something goes wrong.

  print('Waiting for the operation to finish.')
  operation.result(timeout=420)


In [15]:
# async_detect_document("gs://cloud-vision-read/SalesTax_LegalText.pdf","gs://cloud-vision-read/ocr results/")
async_detect_document("gs://cloud-vision-read/Changes to Business Taxes - Resubmission.pdf","gs://cloud-vision-read/ocr results/")

Waiting for the operation to finish.


Following text extraction, the **write_to_text function** retrieves the generated JSON output files. It iterates through each file (representing a batch) and extracts the full text content from the first page within that batch. This section can be modified to process text from all pages if needed.

**Key Points:**

The code assumes the input PDF is located at gcs_source_uri and the output JSON files will be written to gcs_destination_uri in GCS.
The batch size of 100 can be adjusted based on your document size and processing needs.
Note: This explanation focuses on the core functionality. Cloud Vision's response contains more detailed information like bounding boxes and confidence scores for the detected text, which you can explore further in the API documentation.

In [16]:
import time

def write_to_text(gcs_destination_uri):
    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()
    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    print("match",match)
    bucket_name = match.group(1)
    print(bucket_name)
    prefix = match.group(2)
    print(prefix)
    bucket = storage_client.get_bucket(bucket_name)
    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
      # Check if the blob is a file (has an extension)
      if '.' not in blob.name:
        blob_list.remove(blob)

    print(len(blob_list))
    print(blob_list)
    for n in range(len(blob_list)):

      # Process the an output file from GCS.
      # Since we specified batch_size=100, the first response contains
      # the first 100 pages of the input file.
      output = blob_list[n]
      print("output----------------", output.name)
      try:
        json_string = output.download_as_string()

        response = json.loads(json_string)
      except ValueError as e:
        print(f"Error decoding JSON: {e}")
        break

      ##llake a file to write the contents of this batch in
      file = open("batch{}.txt".format(str(n)), "w")
      # The actual response for the first page of the input file.
      for m in range(len(response['responses'])):
        first_page_response = response['responses'][m]
        annotation = first_page_response['fullTextAnnotation']
        # Here we print the full text from the first page.
        # The response contains more information:
        # annotation/pages/blocks/paragraphs/words/symbols
        # inctudting confidence scores and bounding boxes
        print('Full text:\n')
        print(annotation['text'])
        file.write(annotation['text'])


In [17]:
write_to_text("gs://cloud-vision-read/ocr results/")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
0.372% for taxable gross receipts between $100,000,000.01 and $150,000,000
0.557% for taxable gross receipts between $150,000,000.01 and $250,000,000
0.743% for taxable gross receipts between $250,000,000.01 and $500,000,000
0.929% for taxable gross receipts between $500,000,000.01 and $1,000,000,000
1.115% for taxable gross receipts over $1,000,000,000
55
55
Full text:

SANT CO
7025 MAY 10 PM 1:30
(b) "Category 1 Business Activities" means one or more of the business activities described in
NAICS codes 42 (Wholesale Trade), 44 and 45 (Retail Trade), 532 (Rental and Leasing Services),
71 (Arts, Entertainment, and Recreation), 722 (Food Services and Drinking Places), 811 (Repair and
Maintenance), 812 (Personal and Laundry Services) but not including 812930 (Parking Lots and
Garages), and 813 (Religious, Grantmaking, Civic, Professional, and Similar Organizations).
(c) The amount of taxable gross receipts from Category 1 Bu

# Preprocessing
Removing lines that are useless like
"age 2(2) To continue in effect the existing tax at the existing 0.5% rate to fund the
2
3
5
6
7
8
9
10
11
12
13"
 reving the the lines that only contain numbers so yeah


In [None]:
text_file_path = "/content/batch1.txt"

In [19]:
import re

def remove_number_lines(input_file, output_file):
  """
  This function removes lines containing only numbers from a text file.

  Args:
      input_file: Path to the text file to process.
      output_file: Path to write the filtered text.
  """
  with open(input_file, 'r') as f:
    lines = f.readlines()
    print("lines before preprocessing with faltu lines", len(lines))

  filtered_lines = [line for line in lines if not re.match(r"^\d+$", line)]
  print("total lines filtered",len(filtered_lines))

  with open(output_file, 'w') as f:
    f.writelines(filtered_lines)




In [22]:
# Example usage
input_file = "/content/batch2.txt"
output_file = f"{input_file}preprocessed_file.txt"
remove_number_lines(input_file, output_file)

print(f"Lines containing only numbers removed and saved to {output_file}")

lines before preprocessing with faltu lines 308
total lines filtered 304
Lines containing only numbers removed and saved to /content/batch2.txtpreprocessed_file.txt


In [None]:
## reading specific lines
with open("your_text.txt", 'r') as f:
  # Access the first line
  first_line = f.readline()
  print(f"First line: {first_line}")

  # Access the 10th line (assuming it exists)
  tenth_line = f.readlines()[9]
  print(f"Tenth line: {tenth_line}")

  # Access lines 5 to 10 (assuming they exist)
  lines_5_to_10 = f.readlines()[4:10]
  print(f"Lines 5 to 10:\n{lines_5_to_10}")

After preprocessing the preprocessed file is saved

# Chunking Using LLamaIndex
