 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc in Big Data Analytics & Artificial Intelligence*
 
 # Working with PDF Files

We often have to deal with PDF files. There are many [libraries in Python for working with PDFs](https://www.educative.io/courses/pdf-management-python/39Oy3MD180n), each with their pros and cons, the most common one being `PyPDF2`. Note the case sensitivity - you need to make sure your capitilisation matches.

<!--<strong>Note:</strong> Make sure you are in the AI 2 virtual environment before you execute this command. Otherwise you will have problems when importing the `PyPDF2` library.-->

If you are using a local environment you will need to run the following command in the prompt:

    pip install PyPDF2
    
Keep in mind that not every PDF file can be read with this library. It won't be able to read PDFs that are too blurry, have a special encoding, are encrypted, or maybe were just created with a particular program that doesn't work well with PyPDF2.

If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how varied the settings can be, text could be shown as an image instead of a utf-8 encoding.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.

Copy the document `A_Midsummer_Night.pdf` into a folder called `NLP` on your Google Drive. Then we need to install PyPDF2 and mount our Google Drive to interact with the PDFs stored there:

In [None]:
!pip install PyPDF2
# Import the PyPDF2 library
# BE careful of spelling and capitalisation
import PyPDF2
from google.colab import drive
drive.mount('/content/gdrive')

## Reading a PDF File

First we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , `rb`, instead of just `r`.

In [None]:
# Mode = rb reads input as a binary method. We're using a 
# pdf file and not a text file.
my_pdf_file = open("/content/gdrive/My Drive/NLP/A_Midsummer_Night.pdf", mode="rb")

Then we initialise a `PdfReader` object. 

In [None]:
# Initialise a pdf reader object
pdf_reader = PyPDF2.PdfReader(my_pdf_file)

Now we can perform various tasks on the PDF file we've read into the PDF reader object.

In [None]:
len(pdf_reader.pages)

We can now read in text from specific pages. Let's read in the first page of the PDF:

In [None]:
# Indexing of pages in the pdf starts at 0
first_pdf_page = pdf_reader.pages[0]

In [None]:
# Extract the text from the first page
first_pdf_page.extract_text()

We can do this for any page we'd like to view. Let's look at the second page.

In [None]:
second_pdf_page = pdf_reader.pages[1]
# Extract the text from the second page
second_pdf_page.extract_text()

We can see that the text also includes the `\n` newline markers. IF we want to see the text content without the newline marker we can use the print statement together with the previous command.

In [None]:
print(second_pdf_page.extract_text())

And we can store the contents of the text as a string.

In [None]:
my_pdf_text = second_pdf_page.extract_text()

Finally we must close the PDF file.

In [None]:
my_pdf_file.close()

## Copying All Pages Into a String

So far we've looked at editing one page. What if we want to get a copy of all text from the PDF? We can quite easily use a `for` loop to do this:

In [None]:
# Open the PDF for extraction
pdf_file = open("/content/gdrive/My Drive/NLP/A_Midsummer_Night.pdf", mode="rb")

# Define and initialise a string array to contain all pdf text
all_text = [0]

# Initialise a pdf reader object
pdf_document_reader = PyPDF2.PdfReader(pdf_file)

# Use a for loop to iterate through each page
# and then add each page to a string variable
for page_counter in range(len(pdf_document_reader.pages)):
    current_page = pdf_document_reader.pages[page_counter]
    all_text.append(current_page.extract_text())

# Finally close the pdf file
pdf_file.close()

In [None]:
# Show the contents of all_text
all_text