## Week 5 | Web Scraping using Urllib and Beautiful Soup
### Learning Objectives
At the end of this lesson, you will be able to:

- Understand basic HTML
- Retrieve `html` string of a website
- Parse `html` string using `Beautiful Soup` to create a `Soup` data structure
- Apply basic CSS selectors to find elements of a `Soup` data structure
- Find children of an element using a `Soup` data structure
- Find properties of an element using a `Soup` data structure
- Download a pdf using the link found from a `Soup` data structure

## Sending a HTTP Request and getting a Response
- Use `urllib.request.Request` to specify the URL and headers, which is indicating the user agent as google chrome.
- Use `urllib.request.urlopen(req)` to send a request to the server.
- html contains the website in HTML format. It is a long string representing the various tags. Convert it to an object with BeautifulSoup datatype so extraction of tags can be performed.
- Use the prettify() function from BeautifulSoup to see the HTML with indentations.

In [None]:
import requests
from bs4 import BeautifulSoup # HTML data structure
import urllib.request

def getSoup(URL):
  req = urllib.request.Request(URL, headers={'User-Agent':' Chrome'})
  html = urllib.request.urlopen(req).read().decode("utf8")
  soup = BeautifulSoup(html, "html.parser")
  soup.prettify()

  return soup

In [None]:
base = 'https://www.testpapersfree.com/'
soup = getSoup(base)
# soup

## Element Selection using `find()`

- Use `soup.find()` to get the tags you are interested in. Details about `find()` can be found [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)
- Specify the `id`, `class`, etc in the attrs. 
- `soup.find()` only returns the first element found.

In [None]:
# Finds the first div element that has id as main
main_div = soup.find('div', attrs={'id' : 'main'})

In [None]:
# Finds the first div element that has class as posts.
class_div = soup.find('div', attrs={'class' : 'posts'})

In [None]:
# All attrs must match. In this case, none matches so no result is returned.
no_div = soup.find('div', attrs={'class' : 'posts', 'id' : 'main'})

## Multiple Element Selection using `find_all()`

- There are multiple `a` tags in the blog post. To find all `a` tags, use the `find_all()` function. Take note that this will return a `ResultSet`, which is a collection of `Tag`s. More information about `find_all()` can be found [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all).
- The returned object can be treated as a list and iterated through using a for loop

In [None]:
# Finds ALL `a` elements
a_elements = soup.find_all('a')

In [None]:
# Iterate through this ResultSet like a list
for a in a_elements:
    # print(a.text)
    pass

## Handling an element

- `<a href="show.php?testpaperid=87490">P6 Chinese 2020 Prelims - Anglo Chinese</a>`
- Select what is between the <a>My Text</a> using `testpaper_element.text`
- Select the `href` attribute value using [] just like a dictionary, using `testpaper_element["href"]

In [None]:
# Select the first `a` element that contains "testpaper"
testpaper_element = soup.select('a[href*="testpaper"]')[0]
print(testpaper_element.text)
print(testpaper_element["href"])

P6 Chinese 2020 Prelims - Anglo Chinese
show.php?testpaperid=87490


## Download a single PDF

- Create a new `Soup` data structure by navigating to the link found earlier
- Find the download button for the school
- Get the .pdf link from the download button
- Download the .pdf file using the requests module

In [None]:
# Create a new soup based on the new URL
newSoup = getSoup(base + 'show.php?testpaperid=87490')

In [None]:
# Select all the a elements with "pdf" (Download button)
downloadButton = newSoup.select('a[href$=".pdf"]')

In [None]:
# Get the download link as string
pdfLink = base + downloadButton[0]['href']
pdfLink

'https://www.testpapersfree.com/pdfs/P6_Chinese_SA2_2020_ACS_Exam_Papers.pdf'

In [None]:
# Making a directory for pdfs
import os
if not os.path.exists("pdfs"):
  os.mkdir("pdfs")

In [None]:
# Download the pdf to pdfs directroy
r = requests.get(pdfLink, stream=True)
file_name = downloadButton[0]['href']

with open(file_name, 'wb') as f:
  f.write(r.content)

## Download multiple PDFs
- Now, let us combine the code above to download multiple PDFs at once!

In [44]:
def download_pdfs(base):
  soup = getSoup(base)
  for link_element in soup.select('a[href*="testpaper"]'):
    link = link_element["href"]
    newSoup = getSoup(base + link)
    downloadButton = newSoup.select('a[href$=".pdf"]')
    print(downloadButton)
    pdfLink = base + downloadButton[0]['href']
    r = requests.get(pdfLink, stream=True)
    file_name = downloadButton[0]['href']

    with open(file_name, 'wb') as f:
      f.write(r.content)
  return 


## Activity Time!
- Let us put what you have learnt into practice!
- You will now be split into breakout rooms of 4-5 people! This discussion session will last for about 20 minutes. 
- These questions are based on https://www.testpapersfree.com/
- The files should be downloaded through web scraping (code) only.
- The code is only allowed to start at the link above, and not sublinks.
- Each group will send a representative to answer these questions:
1. Download P2-Chinese-2014-SA2-Tao-Nan.pdf (Difficulty: Easy, Hint: It is one of the pdfs available on the main website)
2. Download P6_Maths_SA2_2018_Raffles_Exam_Paper.pdf (Difficulty: Challenging)




In [None]:
# Question 1 solution
def download_pdf_qn1(base):
  soup = getSoup(base)
  for link_element in soup.select('a[href*="testpaper"]'):
    link = link_element["href"]
    newSoup = getSoup(base + link)
    downloadButton = newSoup.select('a[href$=".pdf"]')
    pdfLink = base + downloadButton[0]['href']
    if ("P2-Chinese-2014-SA2-Tao-Nan" in pdfLink):
      r = requests.get(pdfLink, stream=True)
      file_name = downloadButton[0]['href']
      with open(file_name, 'wb') as f:
        f.write(r.content)
      break
  return

download_pdf_qn1

In [27]:
# Question 2 solution
soup = getSoup(base)
link = soup.select('a[href*="2018"]')[0]
soup = getSoup(base + link["href"])

In [28]:
soup.text

'\n\n\nPrimary Test Papers Singapore in year 2018\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n  })(window,document,\'script\',\'https://www.google-analytics.com/analytics.js\',\'ga\');\n\n  ga(\'create\', \'UA-41298864-2\', \'auto\');\n  ga(\'send\', \'pageview\');\n\n\n\n\n     (adsbygoogle = window.adsbygoogle || []).push({\n          google_ad_client: "ca-pub-0422232599241478",\n          enable_page_level_ads: true\n     });\n \n\n        (function () {\n            if (typeof _bsa !== \'undefined\' && _bsa) {\n                _bsa.init(\'flexbar\', \'CKYD627N\', \'placement: testpapersfreecom\');\n            }\n        })();\n    \n\n/* Style the tab buttons */\n.tablink {\n  background-color: #555;\n  color: white;\n  float: left;\n  border: none

In [29]:
links = soup.select('a[href*="/2018/index.php?"]')

In [54]:
def download_pdf_qn2(base, mainLink):
  soup = getSoup(mainLink)
  for link_element in soup.select('a[href*="testpaper"]'):
    link = link_element["href"].replace("../", "")
    newSoup = getSoup(base + link)
    downloadButton = newSoup.select('a[href$=".pdf"]')
    pdfLink = base + downloadButton[0]['href']
    r = requests.get(pdfLink, stream=True)
    file_name = downloadButton[0]['href']

    print(file_name)
    # with open(file_name, 'wb') as f:
    #   f.write(r.content)
  return 

lastPage = links[-1]["href"].split("=")[1]
for page in range(1, int(lastPage) + 1):
  link = base + links[0]["href"].split("=")[0] + "=" + str(page)
  download_pdf_qn2(base, link)

pdfs/P1_Chinese_2018_ACS_test1_Papers.pdf
pdfs/P1_Chinese_2018_Catholic_High_test1_Papers.pdf
pdfs/P1_Chinese_2018_Catholic_High_test2_Papers.pdf
pdfs/P1_Chinese_2018_Maris_Stella_test1_Papers.pdf
pdfs/P1_Chinese_2018_Maris_Stella_test2_Papers.pdf
pdfs/P1_Chinese_2018_MGS_test1_Papers.pdf
pdfs/P1_Chinese_2018_MGS_test2_Papers.pdf
pdfs/P1_Chinese_2018_MGS_test3_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test1_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test2_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test3_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test4_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test5_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test6_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test7_Papers.pdf
pdfs/P1_Chinese_2018_Raffles_test8_Papers.pdf
pdfs/P1_Chinese_2018_River_Valley_test1_Papers.pdf
pdfs/P1_Chinese_2018_SCGS_test1_Papers.pdf
pdfs/P1_Chinese_2018_SCGS_test2_Papers.pdf
pdfs/P1_Chinese_2018_SCGS_test3_Papers.pdf
pdfs/P1_Chinese_2018_SCGS_test4_Papers.pdf
pdfs/P1_HChinese_2018_Ai_Tong_test1

In [None]:
# pdfs/P6_Maths_SA2_2018_Raffles_Exam_Papers.pdf is found!!