### Convert PDF to Image

If you are using Anaconda like me:
[CONDA CHEAT SHEET](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiL1umVjMT6AhWHELcAHR9YC7sQFnoECBAQAQ&url=https%3A%2F%2Fdocs.conda.io%2Fprojects%2Fconda%2Fen%2F4.6.0%2F_downloads%2F52a95608c49671267e40c689e0bc00ca%2Fconda-cheatsheet.pdf&usg=AOvVaw3uUYEqas7NMuAmCCWAx_yl)
- Use Anaconda Navigator to create new virtual environment
- `conda env list` in terminal to list all environment available
- `activate <env name>`
- `env list` to confirm environment is activated

Install dependencies:
- `pip install pykernel`
- `pip install pdf2image`
- Download [poppler](https://github.com/oschwartz10612/poppler-windows/releases/) and unzip it as /Download/poppler-XXX
- `pip install matplotlib`

Section specific references:
- [PDF Parsing](https://www.ismailmebsout.com/pdfs-parsing/)
- [Convert PDF to Image using Python](https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/)
- [Poppler in path for pdf2image](https://stackoverflow.com/questions/53481088/poppler-in-path-for-pdf2image)
- [Unable to get page count. Is poppler installed in PATH?](https://github.com/Belval/pdf2image/issues/142)

In [None]:
# import packages
import matplotlib.pyplot as plt
from pdf2image import convert_from_path
from pdf2image.exceptions import (PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError)

# read in and convert pdf to image
sample_filepath = "C:\\Users\\20jam\\Documents\\always-in-progress\\DSA3101 Data Science in Practice\\project - personal attempts\\original-data\\Tutorial 01\\ST2131_Tut1_T05_done.pdf" # change this
pages = convert_from_path(sample_filepath, poppler_path=r'C:\Users\20jam\Downloads\poppler-22.04.0\Library\bin') # change this

# visualize page 0
print(pages[0])

# save pages
for i in range(len(pages)):
    pages[i].save('modified-data\page'+ str(i) +'.jpg', 'JPEG')

### Split Document into Handwritting & Typed Parts

Install dependencies:
- `pip install cv2`
- `pip install pandas numpy`
- `pip install pandasql`
- `pip install ipython`

Section specific references:
- [Printedand handwritten text extraction from images using Tesseract and Google Cloud Vision API](https://medium.com/@derrickfwang/printed-and-handwritten-text-extraction-from-images-using-tesseract-and-google-cloud-vision-api-ac059b62a535)
- [HandwritingRecognition_GoogleCloudVision](https://github.com/DerrickFeiWang/HandwritingRecognition_GoogleCloudVision/blob/master/OCR_Printed%20and%20handwritten%20text%20extraction%20from%20images%20using%20Tesseract%20and%20Google%20Cloud%20Vision%20API_20200805.ipynb)

In [None]:
# import packages
import os, cv2
import pandas as pd
import pandasql as ps
from IPython.display import Image

# read in images
os.chdir(r'C:\Users\....') # change to folder path containing images
fileList = [x for x in os.listdir() if 'jpg'  in x.lower()]
print(fileList[:5])
Image(filename = fileList[0], width = 300)

In [None]:
# page segmentation 
def findHorizontalLines(img):
    img = cv2.imread(img) 
    #convert image to greyscale
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    # set threshold to remove background noise
    thresh = cv2.threshold(gray,30, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    # define rectangle structure (line) to look for: width 100, hight 1. This is a 
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (200,1))
    # Find horizontal lines
    lineLocations = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=1)
    return lineLocations

img = fileList[0]
lineLocations = findHorizontalLines(img)
plt.figure(figsize=(24,24))
plt.imshow(lineLocations, cmap='Greys')

In [None]:
# formatting lines
df_lineLocations = pd.DataFrame(lineLocations.sum(axis=1)).reset_index()
df_lineLocations.columns = ['rowLoc', 'LineLength']
df_lineLocations[df_lineLocations['LineLength'] > 0]

df_lineLocations['line'] = 0
df_lineLocations['line'][df_lineLocations['LineLength'] > 100] = 1

df_lineLocations['cumSum'] = df_lineLocations['line'].cumsum()
df_lineLocations.head()


query = '''
select row_number() over (order by cumSum) as SegmentOrder
, min(rowLoc) as SegmentStart
, max(rowLoc) - min(rowLoc) as Height
from df_lineLocations
where line = 0
--and CumSum !=0
group by cumSum
'''
df_SegmentLocations  = ps.sqldf(query, locals())
df_SegmentLocations

In [None]:
# crop image
def pageSegmentation1(img, w, df_SegmentLocations):
    img = cv2.imread(img) 
    im2 = img.copy()
    segments = []

    for i in range(len(df_SegmentLocations)):
        y = df_SegmentLocations['SegmentStart'][i]
        h = df_SegmentLocations['Height'][i]

        cropped = im2[y:y + h, 0:w] 
        segments.append(cropped)
        plt.figure(figsize=(8,8))
        plt.imshow(cropped)
        plt.title(str(i+1))        

    return segments

img = fileList[0]
w = lineLocations.shape[1]
segments = pageSegmentation1(img, w, df_SegmentLocations)

### Decode Typed parts through Pytesseract

Install dependencies:
- `pip install re cv2 pytesseract`

Section specific references:
- [Printedand handwritten text extraction from images using Tesseract and Google Cloud Vision API](https://medium.com/@derrickfwang/printed-and-handwritten-text-extraction-from-images-using-tesseract-and-google-cloud-vision-api-ac059b62a535)
- [HandwritingRecognition_GoogleCloudVision](https://github.com/DerrickFeiWang/HandwritingRecognition_GoogleCloudVision/blob/master/OCR_Printed%20and%20handwritten%20text%20extraction%20from%20images%20using%20Tesseract%20and%20Google%20Cloud%20Vision%20API_20200805.ipynb)
- [Image Preprocessing for Pytesseract](https://www.youtube.com/watch?v=ADV-AjAXHdc)
- [OC Python Textbook](https://github.com/wjbmattingly/ocr_python_textbook/blob/main/02_02_working%20with%20opencv.ipynb)

In [None]:
# import packages
import re
import cv2
import pytesseract
from pytesseract import Output

# tell pytesseract where the engine is installed
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

# extract text from image with two columns of contents
def extractTextFromImg(segment):
    text = pytesseract.image_to_string(segment, lang='eng')         
    text = text.encode("gbk", 'ignore').decode("gbk", "ignore")
    return text

# preprocessing images in segment list (optional)
segment = segments[1]
text = extractTextFromImg(segment)
print(text)
segment = segments[2]
text = extractTextFromImg(segment)
print(text)

### Decode Written parts using Slicing & Breta

https://github.com/obss/sahi
https://github.com/Breta01/handwriting-ocr