# Processing the data from the book
I want to extract the words from the Cain's Jawbone paperback into text files so that I can CTRL+F as I work through them and navigate through my notes more easily.   I want to do this easily by photographing the pages of the book, storing them as `.jpg` files in a dedicated folder, and then using OCR to extract the text from these pages.

I will be doing this using the `pytesseract` library and largely following code I took from this [GeeksForGeeks](https://www.geeksforgeeks.org/how-to-extract-text-from-images-with-python/) tutorial



In [9]:
# %pip install pytesseract
# %pip install pillow
%pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
     ------------------------------------- 250.0/250.0 kB 16.0 MB/s eta 0:00:00
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
Note: you may need to restart the kernel to use updated packages.




In [2]:
from PIL import Image 
from pytesseract import pytesseract 
  
# Defining paths to tesseract.exe 
# and the image we would be using 
path_to_tesseract = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image_path = r"raw\archive\example_image.png"
  
# Opening the image & storing it in an image object 
img = Image.open(image_path) 
  
# Providing the tesseract executable 
# location to pytesseract library 
pytesseract.tesseract_cmd = path_to_tesseract 
  
# Passing the image object to image_to_string() function 
# This function will extract the text from the image 
text = pytesseract.image_to_string(img) 
  
# Displaying the extracted text 
print(text[:-1])

now children state should after above same long made such

point run take call together few being would walk give


In [3]:
import os
base_path = r"C:\Users\domjd\OneDrive\Documents\Projects\Cains Jawbone\data"
images = os.listdir(fr"{base_path}\raw")
images.remove('archive')

print(images)

['page_1.jpg', 'page_10.jpg', 'page_100.jpg', 'page_11.jpg', 'page_12.jpg', 'page_13.jpg', 'page_14.jpg', 'page_15.jpg', 'page_16.jpg', 'page_17.jpg', 'page_18.jpg', 'page_19.jpg', 'page_2.jpg', 'page_20.jpg', 'page_21.jpg', 'page_22.jpg', 'page_23.jpg', 'page_24.jpg', 'page_25.jpg', 'page_26.jpg', 'page_27.jpg', 'page_28.jpg', 'page_29.jpg', 'page_3.jpg', 'page_30.jpg', 'page_31.jpg', 'page_32.jpg', 'page_33.jpg', 'page_34.jpg', 'page_35.jpg', 'page_36.jpg', 'page_37.jpg', 'page_38.jpg', 'page_39.jpg', 'page_4.jpg', 'page_40.jpg', 'page_41.jpg', 'page_42.jpg', 'page_43.jpg', 'page_44.jpg', 'page_45.jpg', 'page_46.jpg', 'page_47.jpg', 'page_48.jpg', 'page_49.jpg', 'page_5.jpg', 'page_50.jpg', 'page_51.jpg', 'page_52.jpg', 'page_53.jpg', 'page_54.jpg', 'page_55.jpg', 'page_56.jpg', 'page_57.jpg', 'page_58.jpg', 'page_59.jpg', 'page_6.jpg', 'page_60.jpg', 'page_61.jpg', 'page_62.jpg', 'page_63.jpg', 'page_64.jpg', 'page_65.jpg', 'page_66.jpg', 'page_67.jpg', 'page_68.jpg', 'page_69.jpg',

In [13]:
import re
import numpy as np 
# %pip install opencv-python

import cv2
def process_image_for_text(path_to_image):
    img = cv2.imread(path_to_image)

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    sharpen_kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
    sharpen = cv2.filter2D(gray, -1, sharpen_kernel)
    _, thresh = cv2.threshold(sharpen, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

    # Further noise removal using morphological operations
    kernel = np.ones((1, 1), np.uint8)
    processed_img = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
    
    # Save the processed image to a temporary file
    temp_filename = "temp_image.png"
    cv2.imwrite(temp_filename, processed_img)

    # Use Tesseract to extract text
    custom_config = r'--oem 3 --psm 6'
    text = pytesseract.image_to_string(Image.open(temp_filename), lang='eng', config=custom_config)
        
    return text

In [14]:
# Test function on page 1
test_text = process_image_for_text('raw\page_1.jpg')
print(test_text)

1 sit down alone at the appointed table and
take up my pen to give all whom it may con-
ce cern an exact account of what may happen. © ae
Call me nervous, call me fey, if you will; atleast .
this little pen, this mottled black and silver |
Aquarius, with its nib specially tempered tomy |
 erder in Amsterdam, is greedy. It has not had
| a much work since it flew so nimbly for the dead
old As I watch the sea, Casy Ferris passes
with down-dropped eyes. Of course, to-day is oe
a the day. Her father reminds me of a valetudi-
 narian walrus. But she has, I suppose, to have _
somebody. St. Lazarus-in-the-Chine is full,
so doubt, already. I think she is rash ; but it
__ is none of my business. Where about the graves
___ of the martyr s the whaups are crying, my heart
__temembers how. Strange that he comes into” Hes
__ My head so much to-day. I hope it’s over some
| pee. But all the nice gulls love a sailor Ugh,



In [10]:
# Write the content to a file
import pandas as pd
import openpyxl

# Specify the directory where your .txt files are located
directory_path = f"{base_path}\processed"

# Create an empty DataFrame
df = pd.DataFrame(columns=['File Name', 'Content'])

# Loop through each file in the directory
for filename in os.listdir(directory_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory_path, filename)
        
        # Read the content of the file
        with open(file_path, 'r') as file:  # , encoding='utf-8'
            content = file.read()
        
        # Append the data to the DataFrame
        df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)

# Save the DataFrame to an Excel file
excel_file_path = r'C:\Users\domjd\OneDrive\Documents\Projects\Cains Jawbone\notes.xlsx'
sheet_name = 'text_files'

# Check if the sheet already exists in the workbook
with pd.ExcelWriter(excel_file_path, engine='openpyxl') as writer:
    try:
        writer.book[sheet_name]
    except KeyError:
        # If the sheet does not exist, create it
        df.to_excel(writer, sheet_name=sheet_name, index=False)
    else:
        # If the sheet exists, update it
        existing_df = pd.read_excel(writer, sheet_name=sheet_name)
        updated_df = pd.concat([existing_df, df], ignore_index=True)
        updated_df.to_excel(writer, sheet_name=sheet_name, index=False)

print(f"Excel sheet '{sheet_name}' updated in file: {excel_file_path}")


  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name

Excel sheet 'text_files' updated in file: C:\Users\domjd\OneDrive\Documents\Projects\Cains Jawbone\notes.xlsx


  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name': filename, 'Content': content}, ignore_index=True)
  df = df.append({'File Name