<a href="https://colab.research.google.com/github/Rudelius/multi_pdf_search/blob/main/multi_pdf_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-PDF Searcher
An effective way to scan a large amount of PDF files for information. 

In [None]:
#@markdown # Tutorial
#@markdown Follow the steps A-C to use the program.
%%html
<div style="float:left;padding-right:20px;width:500px;height:520px">
  <h3>A. Uploading your files</h3>
  <p>
    1. Navigate to <a href="https://drive.google.com/drive/my-drive" target="_blank">drive.google.com/drive/my-drive</a>.</br>
    2. Press "New", create a folder, name it <i>myfolder</i> or your name of choice.</br>
    3. Upload your PDF files to the new folder.
  </p>
  <img src=https://i.ibb.co/m81ppJ1/Screenshot-2020-12-27-at-11-56-31.png width="500" style="vertical-align:top">
</div>
<div style="float:left;padding-right:20px;width:500px;height:520px">
  <h3>B. Loading your files</h3>
  <p>
    1. Move down to the next code block.</br>
    2. Replace the <b>drive_folder</b> variable with the name of your folder, e.g. <i>myfolder</i>.</br>
    3. Hover over the code block with your mouse and click the "play" button to run.</br>
    4. Give permission, see below, enter the code with "CTRL+V" and "ENTER". </br>
    5. Wait for the code block to finish running (see the spinning animation).
  </p>
  <img src=https://i.ibb.co/cYVhNCX/Dec-27-2020-12-25-49.gif width="500" style="vertical-align:top;display:inline-block">
  <p>
    </br><b>Ps.</b></br>
    <em>Note that the program will only read and write to the folder specified.</br>
    Repeat this step if you want to load new files, you don't need to authorize again.</br>
    You can have multiple folders and just change the one you search in. </em>
  </p>
</div>
<div style="float:left;width:500px">
  <h3>C. Searching</h3>
  <p>
    1. Move down to the search code block.</br>
    2. Enter a word or sentence as search phrase.</br>
    3. Select among the following options:
  </p>
  <table style="width:100%;border: 1px outset white;">
    <tr>
      <th>Option</th>
      <th>Description</th>
    </tr>
    <tr style ="background-color: gray">
      <td>case_sensitive</td>
      <td>Tick box if you want results to match the case in the search phrase.</td>
    </tr>
    <tr>
      <td>ignore_punctuation</td>
      <td>Tick box if you want to disregard apostrophes and punctuation in text when finding matches.</td>
    </tr>
    <tr style ="background-color: gray">
      <td>allow_partial_matches</td>
      <td>Tick box if you want to allow results such as, e.g. "pple computer" to match "Apple computers"</td>
    </tr>
    <tr>
      <td>output_sentence_word_limit</td>
      <td>Limit the amount of words in the search result text. No limit if set to 0.</td>
    </tr>
  </table>
  <p>
    </br>
    4. Hover over the code block with your mouse and click the "play" button to run.
  </p>
  
</div>


Option,Description
case_sensitive,Tick box if you want results to match the case in the search phrase.
ignore_punctuation,Tick box if you want to disregard apostrophes and punctuation in text when finding matches.
allow_partial_matches,"Tick box if you want to allow results such as, e.g. ""pple computer"" to match ""Apple computers"""
output_sentence_word_limit,Limit the amount of words in the search result text. No limit if set to 0.


In [None]:
#@markdown # Enter the name or path to your drive folder
#@markdown Se the image above to choose the correct name.
drive_folder = "myfolder" #@param {type:"string"}

"""Script to search through multiple PDF files with a key phrase."""
__author__ = "Johan Rudelius"

# Do all imports here
!pip install PyMuPDF
from google.colab import drive
import os
import glob
import functools 
import pandas as pd
import sys, fitz
#import nltk
#nltk.download('punkt')
#from nltk.tokenize import word_tokenize
%load_ext google.colab.data_table

# Mounts your own Google drive to the instance.
drive.mount('/content/drive')

# Creates a list of the PDF file paths.
folder_path = !cd drive/MyDrive && cd {drive_folder} && pwd
pdf_path_list = glob.glob(os.path.join(folder_path[0], "*.pdf"))

# Reads pdf files and writes data to .txt files in drive_folder/results
for pdf_path in pdf_path_list:
  fname = os.path.join(folder_path[0], 'results', \
                       pdf_path.split('/')[-1].split('.')[-2])
  with fitz.open(pdf_path) as doc: 
    out = open(fname + ".txt", "wb") 
    for page in doc:
      text = page.getText().encode("utf8")  # Gets plain text (is in UTF-8)
      out.write(text)
      out.write(bytes((12,)))  # writes page delimiter (form feed 0x0C)
    out.close()

print("The program will read from the folder: ", drive_folder)
print("Found the following files: \n", pdf_path_list)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
The program will read from the folder:  myfolder
Found the following files: 
 ['/content/drive/MyDrive/myfolder/Byggmax 2017.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2018.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2011.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2016.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2015.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2014.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2013.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2012.pdf', '/content/drive/MyDrive/myfolder/Byggmax 2019.pdf']


In [None]:
#@markdown # Enter your search phrase
search_phrase = "leasing"  # @param {type:"string"}
#@markdown ### Search options:
case_sensitive = False  # @param {type:"boolean"}
ignore_punctuation = False  # @param {type:"boolean"}
allow_partial_matches = False  # @param {type:"boolean"}
output_sentence_word_limit = 21 #@param {type:"slider", min:0, max:100, step:1}

"""Script to search through multiple PDF files with a key phrase."""
__author__ = "Johan Rudelius"

# todo:   
#         !!! Test what happens without results folder in myFolder.
#         !!! Touch up documentation when finished.
#         !!  Add a save to csv / sheets function.
#         !!  Add some document data above with nice graphs. (table with nWords, nPages, etc.)
#         !   Make output prettier.
#         !   Add selector for number of sentences.
# doing:  
#         !!  Remove duplicate entries. (Add results to a set?). * Need to add an option to include multiple occurances in one sentence.
# done:   
#         !!! Fixa ord som bara partiellt matchar. (2020-12-27)
#         !!! Fixa ord med punctuation. (2020-12-27)
#         !!! Add some pictures and easy documentation. (2020-12-27)
#         !!  Add limit of X chars before / after. (2020-12-27)
# nodo:
#         !   Improve sentence detection https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt *little value for effort

import string
class DocumentSearchHelper:
  """Helper class that holds data representation and enables search.
  """

  sentence_word_limit = 11
  """Maximum sentence limit that will be output in results - class variable""" 

  def __init__(self, search_pages, search_phrase, case_sensitive,\
               ignore_punctuation, allow_partial_matches,\
               sentence_word_limit):
    """Constructor.
    Args:
        search_pages: an array of document page data in string format.
        search_phrase: string repr. of search phrase, whitespace optional.
        case_sensitive: boolean, if True then match regardless of case.
        ignore_punctuation: boolean, if True remove punctuation when matching.
        allow_partial_matches: boolean, if True, match with 'in' intead of '=='.
        output_sentence_word_limit: int, if 0 no limit, else limits output len.
    """
    self.pages = list(map(lambda x: self._fix_line_breaks(x).split(),\
                          search_pages))
    """A 2D list of text data [pages][words] - instance variable""" 

    if not case_sensitive:
      search_phrase = search_phrase.lower()
    if ignore_punctuation:
      search_phrase = search_phrase.translate(str.maketrans('', '',\
                                                            string.punctuation))
    self.search_phrase = search_phrase.split()
    """Search phrase split into list of words - instance variable""" 

    self.case_sensitive = case_sensitive
    self.ignore_punctuation = ignore_punctuation
    self.allow_partial_matches = allow_partial_matches
    self.sentence_word_limit = sentence_word_limit

  @staticmethod
  def _fix_line_breaks(text):
    """Concatenates hyphenated words broken up with '-' in a linebreak."""
    lines = text.split('\n')
    return functools.reduce(lambda a, b: a[0:-1] + b if a[-1] == '-' \
                            else a + ' ' + b, lines)

  def _get_word(self, page, page_index, return_lower=False,\
                remove_punctuation=False):
    """Gets a word at the position, optionally as lower case and without 
    punctuation."""
    return_word = self.pages[page][page_index]
    if return_lower: 
      return_word = return_word.lower()
    if remove_punctuation:
      return_word = return_word.translate(str.maketrans('', '',\
                                                        string.punctuation))
    return return_word

  def _has_next_word(self, page, page_index):
    """Checks if there is a word succeeding the position."""
    if (0 < page_index + 1 < len(self.pages[page])  # Next word in same page.
            or page_index + 1 == len(self.pages[page])  # Last word in page.
            and page + 1 < len(self.pages)):  # Next page exists.
      return True  # These cases are not exhaustive, so it is not failsafe.
    else:
      return False

  def _has_prev_word(self, page, page_index):
    """Checks if there is a word preceding the position."""
    if (0 < page_index < len(self.pages[page])  # Prev word in same page.
            or page_index == 0  # First word in page.
            and page - 1 >= 0):  # Previous page exists.
      return True  # These cases are not exhaustive, so it is not failsafe.
    else:
      return False

  def _next_word(self, page, page_index):
    """Gets position of the next word from the position."""
    assert (self._has_next_word(page, page_index)),\
      "Out of bounds error, there's no next word."
    if page_index + 1 == len(self.pages[page]):
      return page + 1, 0
    else:
      return page, page_index + 1

  def _prev_word(self, page, page_index):
    """Gets position of the previous word from the position."""
    assert (self._has_prev_word(page, page_index)),\
      "Out of bounds error, there's no prev word."
    if page_index == 0:
      return page - 1, len(self.pages[page - 1]) - 1
    else:
      return page, page_index - 1

  def _get_sentence(self, page, page_index):
    """Gets the surrounding sentence corresponding to the position."""
    sentence_breakers = ['.', '?', '!']
    sentence = self._get_word(page, page_index)
    i = self.sentence_word_limit - 1

    # Append preceding words.
    index = (page, page_index)
    while self._has_prev_word(index[0], index[1]) and not i == 0:
      index = self._prev_word(index[0], index[1])
      prev_word = self._get_word(index[0], index[1])
      if prev_word[-1] in sentence_breakers:
        break  # Naive sentence break check.
      sentence = prev_word + ' ' + sentence
      i -= 1
    if i == 0:
      sentence = '...' + sentence

    # Append subsequent words.
    index = (page, page_index)
    while self._has_next_word(index[0], index[1]) and not i == 0:
      index = self._next_word(index[0], index[1])
      next_word = self._get_word(index[0], index[1])
      sentence = sentence + ' ' + next_word
      if next_word[-1] in sentence_breakers:
        break  # Naive sentence break check.
      i -= 1
    if i == 0:
      sentence += '...'

    return sentence

  def _get_indexes(self):
    """Gets list of all search match positions in the document.
    Returns:
        A list of (page, page_index) tuples (locations of search results).
    """
    indexes = []
    for page in range(len(self.pages)):
      for pageIndex in range(len(self.pages[page])):
        # Check if the next sequence of words equals the search sequence.
        index = (page, pageIndex)
        match = False

        if self.allow_partial_matches:
          try_word = self._get_word(index[0], index[1], not self.case_sensitive,\
                                    self.ignore_punctuation)
          if self.search_phrase[0] in try_word:
            for _ in range(len(self.search_phrase) - 1):
              if not self._has_next_word(index[0], index[1]):
                break
              index = self._next_word(index[0], index[1])
              try_word = try_word + ' ' + self._get_word(index[0], index[1],\
                                                         not self.case_sensitive,\
                                                         self.ignore_punctuation)
            search_sent = functools.reduce(lambda a, b: a + ' ' + b,\
                                           self.search_phrase)
            if search_sent in try_word:
              match = True
        else:
          match = True  # Try to prove otherwise.
          for searchWord in self.search_phrase:
            try_word = self._get_word(index[0], index[1],\
                                      not self.case_sensitive)
            if not try_word == searchWord:
              match = False
              break
            if not self._has_next_word(index[0], index[1]):
              match = False
              break
            else:
              index = self._next_word(index[0], index[1])

        if match:
          indexes.append((page, pageIndex))
    return indexes

  def get_search_results(self):
    """Gets list of all search matches and positions.
    Returns:
        A 2D list in format [[page, page_index, full_sentence]]
    """
    return list(map(lambda x: [x[0], x[1], self._get_sentence(x[0], x[1])],\
                    self._get_indexes()))

# Iterate over all text files.
txt_path_list = glob.glob(os.path.join(folder_path[0], "results", "*.txt"))
df1 = pd.DataFrame(columns=["Document", "Page", "PageIndex", "Text"])
for txt_path in txt_path_list:
  f = open(txt_path, "r")
  # Split text on page separator, except for last separator.
  search_pages = f.read().split('\f')[0:-1]
  f.close()

  # Use DSP class to extract search results, append to data frame.
  dsp = DocumentSearchHelper(search_pages, search_phrase, case_sensitive,\
                             ignore_punctuation, allow_partial_matches,\
                             output_sentence_word_limit)
  df2 = pd.DataFrame(dsp.get_search_results(),\
                     columns=["Page", "PageIndex", "Text"])
  df2.insert(0, 'Document', txt_path.split('/')[-1].split('.')[-2])
  df1 = df1.append(df2, ignore_index=True)

df1

Unnamed: 0,Document,Page,PageIndex,Text
0,Byggmax 2017,36,645,The largest cost component is leasing costs re...
1,Byggmax 2017,58,598,IFRS 16 requires assets and liabilities attrib...
2,Byggmax 2017,58,695,The Group is working on conducting a complete ...
3,Byggmax 2017,58,722,There are primarily leasing contracts relating...
4,Byggmax 2017,58,770,Note 37 contains a specification regarding the...
...,...,...,...,...
390,Byggmax 2019,66,12,...Report Table of contents Business descripti...
391,Byggmax 2019,66,72,"...the sales transaction and the lease, the Gr..."
392,Byggmax 2019,66,106,"...are considered to be operating leases, and ..."
393,Byggmax 2019,80,327,...0.0 -2.2 Closing accumulated depreciation -...


In [None]:
from IPython.display import Image
Image(url='https://upload.wikimedia.org/wikipedia/commons/e/e3/Animhorse.gif')

In [None]:
# To use when implementing save function.
drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in Drive.')