# How to Read PDFs in Python
##### Author: Rachel Sacdalan
##### Created: 07/01/2022

## Table of Contents
1. [Set-Up](#setup)
2. [Code](#code)
3. [Next Steps](#nextSteps)
4. [Sources](#sources)

## Set-Up <a name="setup"></a>

Libraries Used:
- `PyPDF2`
- `requests`

## Code <a name="code"></a>

### Local PDF <a name="localPDF"></a>
After downloading your desired PDF onto your computer, we read in the PDF with `PyPDF2`'s `PdfReader`.

We then print out the first page onto the console.

Steps:
1. Download the Desired PDF
2. Read in the PDF with `PyPDF2`'s `PdfReader`
3. Print out the first page

In [35]:
from PyPDF2 import PdfReader

myPDF = "BeeMovieScript.pdf" # filepath

reader = PdfReader(myPDF)
page = reader.pages[0]
print(page.extract_text())

  Bee Movie Script
 
 According to all known laws of aviation, there is no way a bee should be able to fly.
  
 Its wings are too small to get its fat little body off the ground. The bee, of
  
 course, flies anyway because bees don't care what humans think is impossible.
  
 Yellow, black. Yellow, black. Yellow, black. Yellow, black. Ooh, black and yellow! Let's
  
 shake it up a little. Barry! Breakfast is ready! Ooming! Hang on a second. Hello? -
  
 Barry? - Adam? - Oan you believe this is happening? - I can't. I'll pick you up.
  
 Looking sharp. Use the stairs. Your father paid good money for those. Sorry. I'm
  
 excited. Here's the graduate. We're very proud of you, son. A perfect report card,
  
 all B's. Very proud. Ma! I got a thing going here. - You got lint on your fuzz. - Ow!
  
 That's me! - Wave to us! We'll be in row 118,000. - Bye! Barry, I told you, stop
  
 flying in the house! - Hey, Adam. - Hey, Barry. - Is that fuzz gel? - A little. Special
  
 day, graduation. N

### PDF from a website

After finding the URL of the PDF you want to download

Steps:
1. Find the URL of the desired PDF
2. Request the URL and get a `response`
3. Save the `response` into a new PDF on your local computer
4. Close the PDF

In [36]:
import requests

url = "https://www3.nd.edu/~instres/CDS/2021-2022/CDS_2021-2022.pdf"

print("Downloading file: ")
response = requests.get(url)
  
pdf = open("CDS.pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File downloaded")
  
print("All PDF files downloaded")

Downloading file: 
File downloaded
All PDF files downloaded


### Follow [Local PDF](#localPDF) Steps

In [37]:
reader = PdfReader("CDS1.pdf")
page = reader.pages[0] # prints first page of PDF
print(page.extract_text())

Common Data Set 2021-2022
CDS-AP
age 1

A0
Respondent Information (Not for Publication)

Name:
Title:
Office:


Office of Strategic Planning & Institutional 
Research

Mailing Address:
401 Grace Hall

City/State/Zip/Country:

 Notre Dame, IN 46556
Phone:

 (574) 631-2848

Fax:

 (574) 631-9235

E-mail Address:

instres@nd.edu
X
Yes

No
If yes, please provide the URL of the corresponding Web page:
A0A
A1
Address Information


Name of College/University:
 University of Notre Dame
Mailing Address:

City/State/Zip/Country:

 Notre Dame, IN  46556
Street Address (if different):

City/State/Zip/Country:
Main Phone Number:
 (574) 631-5000
WWW Home Page Address:

www.nd.edu
Admissions Phone Number:
(574) 631-7505
Admissions Toll-Free Phone Number:
Admissions Office Mailing Address:

McKenna Hall

City/State/Zip/Country:

 Notre Dame, IN 46556

Admissions Fax Number:
 (574) 631-8865
Admissions E-mail Address:

admissions@nd.edu


A2

Public


X
Private (nonprofit)
Proprietary


A3
Classify your

## Next Steps <a name="nextSteps"></a>
- Instead of printing out the first page of the PDF, print ALL the pages
- Conduct analysis and build a report
    - Work on "College UnConfidential" Project
- "Text Mine"

## Sources <a name="sources"></a>
- [PyPDF2 Documentation](https://pypdf2.readthedocs.io/en/latest/)
- [Converting PDF to Text](https://www.askpython.com/python/examples/convert-pdf-to-txt)
- [Bee Movie Script Copypasta](https://www.reddit.com/r/copypasta/comments/aair93/bee_movie_script/)
- [Notre Dame Common Data Set](https://www3.nd.edu/~instres/CDS/2021-2022/CDS_2021-2022.pdf)
- [Schema Errors on Stack Overflow](https://stackoverflow.com/questions/30770213/no-schema-supplied-and-other-errors-with-using-requests-get)
- [Downloading PDFs with Python using Requests and Beautiful Soup](https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/)