### Web Scraping PDF Files
<br>

**The following code is a simple program for accessing the web and extracting links to PDF files from any given web page (provided it references links to PDF files anywhere on the page). I have include three versions of the code: the first simply searches for and extracts any pdf links to be found, the second applies filters to ensure these links are valid and active, and the third makes the code more generally applicable, allowing the user to copy and paste any web page they like (instead of using only one sample web pages as in versions 1 and 2).**
<br>
<br>

Note, before running any part of the program, the first two cells must be executed first. The first cell installs the necessary Python libraries whilst the second imports them, making them ready for use. You can run each version of the program individually by selecting its respective cell and clicking on the 'Run' icon. As mentioned, for the third version of the program will prompt you to enter any given web address you choose, and so you can run and re-run it multiple times to try scraping different web pages, provided they contain links to pdf files. Feel free to test the program in whatever way you like.

<br>
<br>


In [None]:
#Installing the Python modules to be used 
!pip install requests 
!pip install bs4


In [2]:
#Importing the Python modules to use 
import requests
from bs4 import BeautifulSoup
import re


### Version 1:
<br> 
For this version, I will simply search for and extract PDF links from the wikipedia page on 'Memory'. As such, I will utilize the requests module to access the web page and the beautiful soup library to parse the page and identify the links. 
<br>
<br>

In [3]:
#First, specifying the url (web address) to the wikipedia page
url = "https://en.wikipedia.org/wiki/Memory"

#making a get request to access web page
page = requests.get(url)

#extracting HTML document
html = page.text

#creating soup object to parse the HTML document
soup = BeautifulSoup(html, 'html.parser')

#to retreive the pdf links (using a regular expression to match with)
pdf_links = soup.find_all(href=re.compile('.+\.pdf'))        #only matches values that end in '.pdf'

#to display the links
for pdf in pdf_links:
    print('link:', pdf.get('href'))


link: https://scholar.harvard.edu/files/schacterlab/files/grafschacter1985.pdf
link: http://www.saylor.org/site/wp-content/uploads/2011/01/TLBrink_PSYCH07.pdf
link: http://bernard.pitzer.edu/~dmoore/psych199s03articles/R-Collier_memory.pdf
link: http://www.crossingdialogues.com/Ms-A14-03.pdf
link: https://web.archive.org/web/20070719053600/http://www.ilcusa.org/_lib/pdf/ISOA.pdf
link: http://www.ilcusa.org/_lib/pdf/isoa.pdf


### Version 2:
<br> 
This version will be similar to before, scraping pdf links from a web page, except this time I will make sure to save only those links that are active into a list, and discard those that are not. To do so, I have include two filters, the first makes sure that the pdf link works at all, i.e., the referenced web page in fact exists, meanwhile the second filter makes sure that the link can be accessed given that sometimes a link might reference a web page that exists and yet is inaccessible (this could be due to access being unauthorized or the requested page is not available, despite being active, among others). Finally, this time I'll access another wikipedia page, the page on 'Attention'. 
<br>
<br>

In [4]:
#Specifying the url (web address) to the wikipedia page
url = "https://en.wikipedia.org/wiki/Attention"

#making a get request to access web page
page = requests.get(url)

#extracting HTML document
html = page.text

#creating soup object to parse the HTML document
soup = BeautifulSoup(html, 'html.parser')

#creating a set object to save active links into
active_links_set = set()


#Searching for and retrieving the pdf links
pdf_links = soup.find_all(href=re.compile('.+\.pdf'))       # only matches values that end in '.pdf'


#Save only those that are active
for element in pdf_links:
    pdf = element.get('href')     #extract pdf link 
    
    #Filter 1: checking if the referenced web page exists all 
    try: 
        req = requests.get(pdf) 
    except:
        print('Page does not exist:', pdf)
        continue 
        
    #Filter 2: checking the status code to make sure the link is accessible    
    pdf_status_code = requests.get(pdf).status_code
    if pdf_status_code != 200:            # status code != 200 would mean that the web page cannot be accessed (even if the link is active)
        print('Page changed or is not found:', pdf)
        continue
    
    #adding active link to the set
    active_links_set.add(pdf)

print('')

#displaying the active links 
for pdf in active_links_set:
    print('Active link:', pdf)
print('')

#Comparing the number of retrieved pdf links vs. only active links 
print("Total number of all links retrieved:", len(pdf_links))
print('Total number of active links:', len(active_links_set))


Page changed or is not found: https://www.aaafoundation.org/sites/default/files/MeasuringCognitiveDistractions.pdf
Page changed or is not found: http://www.princeton.edu/~kahneman/docs/attention_and_effort/Attention_lo_quality.pdf
Page does not exist: http://www.psych.utoronto.ca/users/ferber/teaching/visualattention/readings/Oct6/1998_Friesen_Kingstone_PBR.pdf
Page does not exist: http://cns-web.bu.edu/Profiles/Mingolla.html/cnsftp/cn730-2007-pdf/posner_petersen90.pdf
Page changed or is not found: http://www.icn.ucl.ac.uk/lavielab/reprints/Lavie-etal-04.pdf
Page does not exist: http://www.klab.caltech.edu/~xhou/papers/cvpr07.pdf
Page changed or is not found: http://www.cim.mcgill.ca/~lijian/06243147.pdf

Active link: http://people.ucsc.edu/~brogoff/Psych247articles/Morelli%20et%20al%20Cultural%20Var%20in%20Young%20Children%27s%20Access.pdf
Active link: https://web.archive.org/web/20130626052615/http://www.icn.ucl.ac.uk/lavielab/reprints/lavie-etal-04.pdf
Active link: https://www.msu.e

### Version 3:
<br> 
This version is more general purpose. It consists of the same code as before, except this time it allows the user to enter any web page they like. More coded was also added to ensure the entered web address is valid and/or that it does contain any pdf links. It also filters the pdf links and saves only the active ones. 
<br> 
To run this cell, click on the 'Run' icon above. Feel free to try it several times on different web pages to scrape pdf links from.
<br>
<br>

In [None]:
while True:
    #Prompting the user for a url (web address)
    url = input('Enter web address: ')
    
    #Checking if the web page entered exists 
    try:
        #making a get request to access web page
        page = requests.get(url)
        #extracting the HTML document
        html = page.text
    except:
        print('Web Address is incorrect or does not exist. Please try again.')
        continue 


    #Creating soup object to parse the HTML document
    soup = BeautifulSoup(html, 'html.parser')

    #creating a set object to save active links into
    active_links_set = set()


    #Searching for & retreiving pdf links (if any are found)
    pdf_links = soup.find_all(href=re.compile('.+\.pdf'))       #only matches values that end in '.pdf'

    #checking if any pdf links were found
    if len(pdf_links) > 0:
        break           #if yes, breaks from the loop
    else:
        print('Page does not contain any PDF links. Try a different web page.')
        continue 


#Saving only those that are active
for element in pdf_links:
    pdf = element.get('href')     #extract pdf link 
    
    #Filter 1: checking if the referenced web page exists all 
    try: 
        req = requests.get(pdf) 
    except:
        continue 
        
    #Filter 2: checking the status code to make sure the link is accessible    
    pdf_status_code = requests.get(pdf).status_code
    if pdf_status_code != 200:            # status code != 200 would mean that the web page cannot be accessed (even if the link is active)
        continue
    
    #adding active link to the set
    active_links_set.add(pdf)


#Displaying the retrieved pdf links 
for pdf in active_links_set:
    print('Link:', pdf)

#displaying the number of pdf links retrieved 
print('\nTotal number of active pdf links retrieved:', len(active_links_set))
