Colab script for converting PDF documents into .txt format.
It requires a fresh install of anaconda into Colab and a bunch of other packages. The process pretty much builds a new virtual machine just for use with tesseract. Once everything is built it is generally pretty quick in converting the pdf into text. This might work better on a docker with more capacity or on an actual computer with a gpu and lots of ram. I try to scale my projects around the capabilities of Colab.