<a href="https://colab.research.google.com/github/Fuenfgeld/LLM-Utility-Cookbook/blob/main/ScanToText.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDFScan to Text
This notebook uses Python to apply Optical Character Recognition (OCR) on image and PDF files. It retrieves the necessary files, installs required libraries, extracts text from an image, converts PDFs to images, and performs OCR on these images. 

### Git Clone
Here we clone the GitHub repository which contains the necessary data and files we will use in the notebook.

In [None]:
!git clone https://github.com/Fuenfgeld/LLM-Utility-Cookbook.git

### Install Required Libraries
In this section, we install the required libraries for OCR. We use Tesseract for OCR and Poppler-utils for converting PDFs to images.

In [None]:
!sudo apt install tesseract-ocr
!sudo apt-get install -y poppler-utils

In [None]:
%pip install pytesseract
%pip install pdf2image

In [None]:
import pytesseract
from PIL import Image

### Upload File (Optional)
In case you want to upload a file from your local computer, uncomment and run this cell. This will prompt you to select a file from your local filesystem.

In [None]:
#from google.colab import files
#uploaded = files.upload()

# PDF to Image to Text
For PDFs, the process is a bit different. Since OCR engines typically work on images, we first convert the PDF to images. Each page of the PDF is converted into a separate image. Then, we apply the OCR engine to each image to extract the text.

In [None]:
imagePath ='/content/LLM-Utility-Cookbook/data/DocImage.png'

In [None]:
extractedText = pytesseract.image_to_string(Image.open(imagePath))

In [None]:
extractedText

### PDF to Image to Text
For PDFs, the process is a bit different. Since OCR engines typically work on images, we first convert the PDF to images. Each page of the PDF is converted into a separate image. Then, we apply the OCR engine to each image to extract the text.

In [None]:
from pdf2image import convert_from_path
import os

### Convert PDF to Images
We convert each page of the PDF into a separate image using the pdf2image library. These images are saved in a specified output directory.

In [None]:
pdfPath = '/content/LLM-Utility-Cookbook/data/ScanPDF.pdf'
outputDirPath = '/content/pdfImages'
os.makedirs(outputDirPath,exist_ok=True)

images = convert_from_path(pdfPath)
for i, image in enumerate(images):
  image.save(outputDirPath + '/output' + str(i) + '.jpg', 'JPEG')

### Extract Text from Images
We iterate through each image that resulted from the PDF conversion and extract text using Tesseract. The text is saved in a dictionary with the image's filename as the key for easy lookup.

In [None]:
imagesToProcess = os.listdir(outputDirPath)
extractedTextPages = {}

for tempFileName in imagesToProcess:
  tempPath = outputDirPath + '/' + tempFileName
  extractedTextPages[tempFileName] = pytesseract.image_to_string(Image.open(tempPath))

In [None]:
extractedTextPages

If you want tu reexecute the os.makedirs function you have to delete the folder first (use below)

In [None]:
!rm -r /content/pdfImages