##### Disclaimer: This project wasn't initially made to be flexible and so we did't define functions/classes.

### Prep Google Cloud Vision stuff
Make sure you have the Vision API key in the same folder. Importing `os` allows us to work with environment variables. We set google cloud credentials to contents of `vision_key.json`. 

In [1]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='vision_key.json'
from google.cloud import vision
vision_client = vision.ImageAnnotatorClient()
image = vision.Image()

### Extract text from images
Prepare each page of readings `A` to `F` as a separate image with positive integer indexed file names. The set of images are then filed under folders `A` to `F`.<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) The images per reading are then extracted using `vision` methods and saved as strings under `pages`. Each set of pages are then saved to `references`. <a name="cite_ref-2"></a>[<sup>[2]</sup>](#cite_note-1)

In [None]:
import re 
import io
refnames = ['A','B','C','D','E','F']
references = []
for refname in refnames:
    images = r'C:\Users\johns\Documents\GitHub\image-text-extraction\images'
    images += '\\' + refname
    
    files = os.listdir(images) 
    image_count = 0
    for image in files:
        image_count += 1
    
    pages = []
    for image in range(1,image_count+1):
        image_path = r'C:\Users\johns\Documents\GitHub\image-text-extraction\images' 
        image_path += '\\' + refname + '\\' + str(image) + '.jpg'

        with io.open(image_path, 'rb') as image_file:
            content = image_file.read()

        image = vision.Image(content=content)

        response = vision_client.text_detection(image=image)

        text = response.text_annotations[0].description
    
        pages.append(text)
    references.append(pages)

Here, we simply encode each reference as a single string. <a name="cite_ref-3"></a>[<sup>[3]</sup>](#cite_note-3)

<b>Note to self</b>: Don't run cell below because above code does not currently work due to expired billing. You will overwrite the presaved text files.

In [None]:
'''
ref_final = []
for ref in references:
    oneString = ''
    for page in ref:
        oneString += page
    ref_final.append(oneString)
    
for index in range(0,len(refnames)):
    dirref = r'C:\Users\johns\Documents\GitHub\image-text-extraction\textfiles'
    dirref += '\\' + refnames[index] + '.txt'
    with open(dirref, 'w',encoding="utf-8") as f:
            f.write((ref_final[index]).replace("\n", " "))
'''

### Comparing extracted files
1. Encode strings to match in text file `keys.txt`. <a name="cite_ref-4"></a>[<sup>[4]</sup>](#cite_note-4)
2. Strip spaces when comparing strings (gets rid of possible hiccups).
3. We will also not probably get an exact match since it's possible the professor encoded the letters differently (e.g. extra punctuations especially quotes and ellipsis), and so we take only fractions of strings <b>from the center</b> of each string.

In [2]:
#we just use this for sorting the answers from first to last by order of keys
def recsort(list1,length,list2=None):
    if list2==None:
        for index in range(0,length-1):
            if list1[index] > list1[index+1]:
                list1[index],list1[index+1] = list1[index+1],list1[index]
        if length != 0:
            recsort(list1,length-1)
    elif len(list2)!=len(list1):
        print("Invalid list lengths. Multiple lists should have the same length.")
        pass
    else:
        for index in range(0,length-1):
            if list1[index] > list1[index+1]:
                list1[index],list1[index+1] = list1[index+1],list1[index]
                list2[index],list2[index+1] = list2[index+1],list2[index]
        if length != 0:
            recsort(list1,length-1,list2)

In [5]:
refstrings = []

for refname in refnames:
    dirref = r'C:\Users\johns\Documents\GitHub\image-text-extraction\textfiles'
    dirref += '\\' + refname + '.txt'
    with open(dirref, 'r',encoding="utf-8") as f:
        refstrings.append(f.read())

In [6]:
keys = []
with open(r'C:\Users\johns\Documents\GitHub\image-text-extraction\textfiles\keys.txt', encoding = 'utf-8') as f:
    for row in f:
        keys.append(row)
        
import numpy as np

#cleaning of strings
new_refstrings = []
for refstring in refstrings:
    new_refstrings.append(refstring.replace(' ','').lower())
new_keys = []
for key in keys:
    new_keys.append(key.replace(' ','').lower())
    
#comparing of strings; very makeshift heuristic solution by comparing inner text of strings
tolerance = 0.2
ref_count = 0
answerkey,answerindex = [],[]
for refstring in new_refstrings:
    for index in range(0,len(keys)):
        start = int(len(new_keys[index])/2 - (tolerance/2)*len(new_keys[index]))
        last = int(len(new_keys[index])/2 + (tolerance/2)*len(new_keys[index]))
        if new_keys[index][start:last] in refstring:
            #print(new_keys[index][start:last]) #just useful for seeing how our key strings look
            answerkey.append(refnames[ref_count])
            answerindex.append(index+1)
    ref_count += 1

#just for sorting/beautifying answers
recsort(answerindex,len(answerindex),answerkey)

counter = 0
for i in answerkey:
    print(str(answerindex[counter])+": "+i)
    counter += 1
print("Total: ", (counter/85)*100, "%")

1: D
2: F
5: F
6: E
7: B
8: B
10: F
11: E
13: D
15: E
16: F
17: F
18: D
19: E
20: E
21: F
23: F
24: F
25: F
26: E
28: E
30: F
32: D
34: E
36: E
38: F
39: F
40: B
46: B
49: E
50: B
53: E
56: F
57: F
60: F
62: B
63: F
64: B
65: B
66: F
67: E
68: F
69: F
73: F
74: F
75: F
78: C
79: F
80: C
82: F
83: F
Total:  60.0 %


#### Footnotes:
<a name="cite_note-1"></a>[[1]](#cite_ref-1) Using absolute paths for file directories is VERY clunky, but the program tends to throw an error otherwise. Decided not to figure this out.<br>
<a name="cite_note-2"></a>[[2]](#cite_ref-2) Cell currently throws the error `PermissionDenied: 403 This API method requires billing to be enabled.` because of my expired billing account. <br>
<a name="cite_note-3"></a>[[3]](#cite_ref-3) Just because it's easier to think about. Might also help speedup the process (?) since there's only a single string file to work with. <br>
<a name="cite_note-4"></a>[[4]](#cite_ref-4) I did this manually...