<a href="https://colab.research.google.com/github/Digital-Huge-Manitees/Digital_Huge_Manitees/blob/main/OCR_text_analysis_on_Colab_v68.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OCR and text analysis on pdf files
This notebooks uses Optical Character Recognition, using the Pytesseract library, a wrapper for Google's Tesseract OCR engine, which is based upon a long history of text recognition initially developed at Hewlett-Packard. Once made public, Google further developed this for several years as Tesseract, building more libraries to handle over 100 languages, which can be found the documentation: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html<br>
## Input
You will need high quality PDf files as input. They can be multi-page and they can be digital, scanned image only, or searchable types. This notebook was intended to provide usable text from scanned image PDF files commonly found in archives. 

## Output
This notebook will output a .txt file of all text in the document. Text can be incomplete due to damage on the original document, unusual typefaces, or scan quality issues. 

## References:
https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
This notebook was adapted based on the above reference tutorial. 

## Dependencies:
If you decide to run on locally on your computer by downloading this notebook, it will be essential to create an environment with these dependencies. Read the documentation carefully, as there is an order in which these are installed. 
- pytesseract - https://pypi.org/project/pytesseract/
- tesseract - https://github.com/tesseract-ocr/tesseract
- pdf2image - https://github.com/Belval/pdf2image


## Step 1: Link to your Google Drive account and create a new working directory
Wait, what?! You don't have a Google account? You'll need to create a Google account here: https://support.google.com/accounts/answer/27441?hl=en

### You will need to make sure you are signed into your Google account through your browser. Google Chrome works great for this. 

#### Step 1b: 
**Then - save a copy of this notebook (which is on GitHub) to your Google Drive.** Changes will be saved to your own copy. 

#### Step 1c: 
When you run the following cell, a dialog box will pop up asking you to select your Google account and confirm that you approve access. Once this is complete, go to your Google Drive, and you should see a new folder called Colab Notebook and with in it, OCR_Project_Folder. 
All of your work will be saved to this new folder, which will be your working directory. 




In [None]:
#mount google drive here
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os 

# Set your working directory to a folder in your Google Drive. 
# the base Google Drive directory
root_dir = "/content/drive/My Drive/"
# choose where you want your project files to be saved
project_folder = "Colab Notebooks/OCR_Project_Folder/"

def create_and_set_working_directory(project_folder):
  # check if your project folder exists. if not, it will be created.
  if os.path.isdir(root_dir + project_folder) == False:
    os.mkdir(root_dir + project_folder)
    print(root_dir + project_folder + ' did not exist but was created.')

  # change the OS to use your project folder as the working directory
  os.chdir(root_dir + project_folder)

  # show me the current working directory

  print('\nYour working directory was changed to ' + root_dir + project_folder)

create_and_set_working_directory(project_folder)

#source: https://robertbrucecarter.com/writing/2020/06/setting-your-working-directory-to-google-drive-in-a-colab-notebook/

Mounted at /content/drive

Your working directory was changed to /content/drive/My Drive/Colab Notebooks/OCR_Project_Folder/


###  Step 1d: Now, you have a working directory!
- [ ] Go back to GitHub (https://github.com/Digital-Huge-Manitees/Digital_Huge_Manitees) and download the first pdf file from the archive folder. 
- [ ] Then, go to your Google Drive and locate the OCR_Project_Folder you created in the cells above. 
- [ ] Move the pdf file to this Google folder (your working directory). 
- you will repeat this process with every pdf you want to analyze. 

## Step 2: Intall dependencies
The following cell installs the libraries needed by this notebook.

In [None]:
#install tesseract, pdf2image
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get install poppler-utils 
!pip install pdf2image
! pip install Pillow
! pip install pytesseract

# Import libraries
import pytesseract
from PIL import ImageEnhance, ImageFilter, Image
import sys
from pdf2image import convert_from_path
import os
from google.colab import files
from tqdm import tqdm

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libtesseract-dev is already the newest version (4.00~git2288-10f4998a-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.12).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use

### Optional: Change runtime type from CPU to GPU.
IF you are processing a long multipage PDF file OR multiple PDFs, you may need more computational power. A GPU or TPU can process faster, however, lots of GPU usage may require you to upgrade your account. 
You can do this from the Runtime dropdown menu in the top bar and select 'Change Runtime type'. 
The cell below confirms your change. 

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 10869559926890696348
 xla_global_id: -1, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 14444920832
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 11636129454917731277
 physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
 xla_global_id: 416903419]

## Step 3: gather the PDF files and create the OsCaRizer function. 
The first cell below locates any PDF files you have placed in the working directory. 
The second cell creates the function that will performs two actions:

1.   creating an image from each page of the PDF
2.   then rendering that image into a .txt file with the same name as the original input file.

 

In [None]:
"""
This should find all PDF files in working dir. 
HOWEVER, it will not find those nested in subdirectories
"""
path_of_the_directory = os.getcwd()
ext = ('.pdf')
pdf_list = []
for files in os.listdir(path_of_the_directory):
    if files.endswith(ext):
        print(files)
        pdf_list.append(files)

    else:
        continue
print('done - move on to next cell')



dalhousiegazette_volume54_issue20_december_6_1922.pdf
dalhousiegazette_volume54_issue21_january_10_1923.pdf
dalhousiegazette_volume54_issue22_january_17_1923.pdf
done - move on to next cell


In [None]:
"""
create function to run in a loop
through the pdf list

"""
def OsCaRizer(pdf_file):

  # Path of the pdf
  #PDF_file = '/content/drive/MyDrive/Colab Notebooks/OCR_Project_Folder/dalhousiegazette_volume54_issue19_november_29_1922.pdf'
  PDF_file = path_of_the_directory + "/" + pdf_file

  #suppress warnings about image size from PIL
  Image.MAX_IMAGE_PIXELS = None   # disables the warning
    
  # Store all the pages of the PDF in a variable
  pages = convert_from_path(PDF_file, 500)
    
  # Counter to store images of each page of PDF to image
  image_counter = 1
    
  # Iterate through all the pages stored above
  for page in pages:
    
      # Declaring filename for each page of PDF as JPG
      # For each page, filename will be:
      # PDF page 1 -> page_1.jpg
      # PDF page 2 -> page_2.jpg
      # PDF page 3 -> page_3.jpg
      # ....
      # PDF page n -> page_n.jpg
      filename = "page_"+str(image_counter)+".jpg"
        
      # Save the image of the page in system
      page.save(filename, 'JPEG')
    
      # Increment the counter to update filename
      image_counter = image_counter + 1
  """
  process the image files to text
  """
  #suppress warnings about image size from PIL
  Image.MAX_IMAGE_PIXELS = None   # disables the warning

  # Variable to get count of total number of pages
  filelimit = image_counter-1
    
  # Creating a text file to write the output
  outfile = ((pdf_file.rsplit( ".", 1)[0]) + '.txt')
    
  # Open the file in append mode so that 
  # All contents of all images are added to the same file
  f = open(outfile, "a")
    
  # Iterate from 1 to total number of pages
  for i in (range(1, filelimit + 1)):
    
      # Set filename to recognize text from
      # These files will be:
      # page_1.jpg
      # page_2.jpg
      # ....
      # page_n.jpg
      filename = "page_"+str(i)+".jpg"
            
      # Recognize the text as string in image using pytesserct
      text = str(((pytesseract.image_to_string(Image.open(filename)))))
    
      # The recognized text is stored in variable text
      # Any string processing may be applied on text
      # Here, basic formatting has been done:
      # In many PDFs, at line ending, if a word can't
      # be written fully, a 'hyphen' is added.
      # The rest of the word is written in the next line
      # Eg: This is a sample text this word here GeeksF-
      # orGeeks is half on first line, remaining on next.
      # To remove this, we replace every '-\n' to ''.
      text = text.replace('-\n', '')    
    
      # Finally, write the processed text to the file.
      f.write(text)
    
  # Close the file after writing all the text.
  f.close()
  """
  clean out page image files before next instance of loop
  """

  files = os.listdir(path_of_the_directory)
  for f in files:
    if not os.path.isdir(f) and "page" in f:
      os.remove(f)

  print('done - you should see a .txt file named after your .pdf file in your working directory')


## Step 4: Runs the OsCaRizer function.
Each multipage PDF can take 3-5 minutes. As an example, a 7-page PDF takes about 4 minutes. 3 of these multipage PDFs can take 12-15 minutes. 

You can interupt the process anytime, just be aware that it takes a while. 

In [None]:
for k in tqdm(range(len(pdf_list))):
    pdf_file = pdf_list[k]
    OsCaRizer(pdf_file)
    print('the file ' + pdf_file + ' is complete')


 33%|███▎      | 1/3 [04:42<09:24, 282.14s/it]

done - you should see a .txt file named after your .pdf file in your working directory
the file dalhousiegazette_volume54_issue20_december_6_1922.pdf is complete


 67%|██████▋   | 2/3 [10:12<05:10, 310.47s/it]

done - you should see a .txt file named after your .pdf file in your working directory
the file dalhousiegazette_volume54_issue21_january_10_1923.pdf is complete


100%|██████████| 3/3 [15:12<00:00, 304.23s/it]

done - you should see a .txt file named after your .pdf file in your working directory
the file dalhousiegazette_volume54_issue22_january_17_1923.pdf is complete





### woo-hoo! You should see a successfully created .txt file in your folder. 

You can download the .txt file to read in a text editor to check the quality. 

## That's it!
You get the following from this notebook that can be used for further analysis:
- the .txt which is the original, unmodified text as interpreted by the OCR engine. This preserves the original structure and may be useful in certain analysis. It is named the same as the original PDF file for retrieval at a later date if needed. 
<br>
<br>
Further analysis can compare the terms found in this article to others and rank them in importance, such as TF-IDF (see our Term Frequency and TFIDF notebook). Or the documents can be analyzed for topics and those topics compared against other documents using LDA with Spacey and Gensim (see our LDA notebook). 

## **If you are ready to process another pdf** 
You should MOVE the .txt file you just created along with the .pdf file to a safe place (such as a different Google Drive directory, leaving your working directory empty. 

The image files are deleted after each processing loop, though the Colab file explorer takes a few seconds to update sometimes. 

**Download the next PDF file(s) from GitHub**, (or whereever you may have them.) And place the new PDF or PDFs in your working directory. 
