# Web Scraping Example
# Fall 2021
### Drake Wagner

#### This example can be found at: https://github.com/DrakeWagner/ds-5100-web-scraping-examples

Beautiful Soup Docs: https://beautiful-soup-4.readthedocs.io/en/latest/

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
headers = {'user-agent': 'UVA Class Example (dbw2tn@virginia.edu) (Language=Python 3.8.2; Platform=Linux(MX-19.4 / 31))'} 
# this "header" is sent along with the request to the website. Optional, but good habit so the owner of the site can see who is visiting.
# more info on headers: https://docs.developer.amazonservices.com/en_US/dev_guide/DG_UserAgentHeader.html

URL = 'http://books.toscrape.com/'

page = requests.get(URL, headers = headers) # request html content from the site
page # checks if access was granted

Status codes for web requests: https://www.w3.org/Protocols/HTTP/HTRESP.html
For example, if you were to get `<Response [404]>` from the cell above, this would indicate that no server was found from the given URL. `Response [200]` means we have access to the html content.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
# This line creates an object that takes the HTML content that we scraped with requests.get() as the input
# The html.parser tells our webscraper that we are working with html data
# There are other parsers for other types of documents, such as xml

In [None]:
# page.text     # shows us the source html, if run

Now that we have our html data scraped and assigned to a variable, it's time to start digging through and looking for the information we want to extract.

A line of html code contains:
1. A tag
2. an attribute
3. a value
4. a string/text

This link gives a good overview of the parts of html, as well as further tips for working with Beautiful Soup: https://www.pluralsight.com/guides/web-scraping-with-beautiful-soup

Going to your website and pressing `F12` opens up devtools, which lets us see the html of the website in hierarchical structure.

Let's make a list of all the books on the first page. Start by inspecting the html code. We see that the title of each book is displayed under a "h3" tag.

![alt text](book_ss1.png "webpage")
![alt text](book_ss2.png "html code")

In [None]:
book_list = []

books = soup.find_all('h3') # searches the html for all instances with a 'h3' tag
                            # inspecting the webpage shows us that this specific tag hosts each book title
for i in books:
    book_list.append(i.get_text()) # get_text() removes the html tags/junk and keeps just the text
                                   # here, we loop through for each instance of 'h3' and extract just the text

book_list

In [None]:
# Let's do another example where we find the genres from the list on the left of the website

genres = soup.find('ul', {'class' : 'nav'}) # the webpage shows each tag with a genre also has the class 'nav'
                                            # this can also be written as soup.find('ul', class_='nav')
genres = str(genres.get_text()) # remove html tags/etc.
genres = genres.replace(' ', ' ').replace('\n', '').split() # get rid of whitespace/new lines

genres[0:10]