# Web Scraping Exercises - Solutions

## Complete the Tasks Below

**TASK: Import any libraries you think you'll need to scrape a website.**

In [None]:
# CODE HERE

In [None]:
import requests
import bs4

**TASK: Use requests library and BeautifulSoup to connect to http://quotes.toscrape.com/ and get the HMTL text from the homepage.**

In [None]:
# CODE HERE

In [None]:
res = requests.get("http://quotes.toscrape.com/")

In [None]:
res.text

**TASK: Get the names of all the authors on the first page.**

In [None]:
# CODE HERE

In [None]:
soup = bs4.BeautifulSoup(res.text,'lxml')

In [None]:
soup

In [None]:
soup.select(".author")

In [None]:
# I used a set to not worry about repeat authors.
authors = set()
for name in soup.select(".author"):
    authors.add(name.text)

In [None]:
authors

In [None]:
l = soup.select(".author")
authors = set()
for name in l:
    authors.add(name.text)

In [None]:
authors

**TASK: Create a list of all the quotes on the first page.**

In [None]:
#CODE HERE

In [None]:
quotes = []
for quote in soup.select(".text"):
    quotes.append(quote.text)

In [None]:
quotes

**TASK: Inspect the site and use Beautiful Soup to extract the top ten tags from the requests text shown on the top right from the home page (e.g Love,Inspirational,Life, etc...). HINT: Keep in mind there are also tags underneath each quote, try to find a class only present in the top right tags, perhaps check the span.**

In [None]:
# CODE HERE

In [None]:
soup = bs4.BeautifulSoup(res.text,'lxml')

In [None]:
soup.select('.tag-item')

In [None]:
for item in soup.select(".tag-item"):
    print(item.text)

In [None]:
item_list = []
for item in soup.select(".tag-item"):
    item_list.append(item.text)
item_list

In [None]:
item_list = []
for item in soup.select(".tag-item"):
    item_list.append(item.text.strip())
item_list

**TASK: Notice how there is more than one page, and subsequent pages look like this http://quotes.toscrape.com/page/2/. Use what you know about for loops and string concatenation to loop through all the pages and get all the unique authors on the website. Keep in mind there are many ways to achieve this, also note that you will need to somehow figure out how to check that your loop is on the last page with quotes. For debugging purposes, I will let you know that there are only 10 pages, so the last page is http://quotes.toscrape.com/page/10/, but try to create a loop that is robust enough that it wouldn't matter to know the amount of pages beforehand, perhaps use try/except for this, its up to you!**

In [None]:
# CODE HERE

### Possible Solution #1 ( Assuming You Know Number of Pages)

In [None]:
url = 'http://quotes.toscrape.com/page/'

In [None]:
authors = set()

for page in range(1,11):

    # Concatenate to get new page URL
    page_url = url+str(page)
    # Obtain Request
    res = requests.get(page_url)
    # Turn into Soup
    soup = bs4.BeautifulSoup(res.text,'lxml')
    # Add Authors to our set
    for name in soup.select(".author"):
        authors.add(name.text)


In [None]:
for page in range(1,11):
    page_url = url+str(page)
    print(page_url)

### Possible Solution #2 ( Unknown Number of Pages, but knowledge of last page)

Let's check what the last invalid page looks like:

In [None]:
# Choose some huge page number we know doesn't exist
page_url = url+str(9999999)

# Obtain Request
res = requests.get(page_url)

# Turn into Soup
soup = bs4.BeautifulSoup(res.text,'lxml')

In [None]:
soup

In [None]:
# This solution requires that the string "No quotes found!" only occurs on the last page.
# If for some reason this string was on the other pages, we would need to be more detailed.
"No quotes found!" in res.text

In [None]:
page_still_valid = True
authors = set()
page = 1

while page_still_valid:

    # Concatenate to get new page URL
    page_url = url+str(page)

    # Obtain Request
    res = requests.get(page_url)

    # Check to see if we're on the last page
    if "No quotes found!" in res.text:
        break

    # Turn into Soup
    soup = bs4.BeautifulSoup(res.text,'lxml')

    # Add Authors to our set
    for name in soup.select(".author"):
        authors.add(name.text)

    # Go to Next Page
    page += 1

In [None]:
authors

#  Project - Working with Multiple Pages and Items



Let's show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html is specifically designed for people to scrape it. Let's try to get the title of every book that has a 2 star rating and at the end just have a Python list with all their titles.

We will do the following:

1. Figure out the URL structure to go through every page
2. Scrap every page in the catalogue
3. Figure out what tag/class represents the Star rating
4. Filter by that star rating using an if statement
5. Store the results to a list

We can see that the URL structure is the following:

    http://books.toscrape.com/catalogue/page-1.html

In [None]:
s = 'sarra{}hanen'
s.format('&')

In [None]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

We can then fill in the page number with .format()

In [None]:
res = requests.get(base_url.format('1'))

Now let's grab the products (books) from the get request result:

In [None]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [None]:
soup.select(".product_pod")

Now we can see that each book has the product_pod class. We can select any tag with this class, and then further reduce it by its rating.

In [None]:
products = soup.select(".product_pod")

In [None]:
example = products[0]

In [None]:
type(example)

In [None]:
example

In [None]:
example.text

In [None]:
example.attrs

Now by inspecting the site we can see that the class we want is class='star-rating Two' , if you click on this in your browser, you'll notice it displays the space as a . , so that means we want to search for ".star-rating.Two"

In [None]:
example

In [None]:
products = soup.select(".product_pod")
example = products[0]
list(example.children)

In [None]:
example.select('.star-rating.Three')

But we are looking for 2 stars, so it looks like we can just check to see if something was returned

In [None]:
example.select('.star-rating.Two')

Alternatively, we can just quickly check the text string to see if "star-rating Two" is in it. Either approach is fine (there are also many other alternative approaches!)

Now let's see how we can get the title if we have a 2-star match:

In [None]:
example

In [None]:
example.select('a')

In [None]:
example.select('a')[1]

In [None]:
example.select('a')[1]['title']

Okay, let's give it a shot by combining all the ideas we've talked about! (this should take about 20-60 seconds to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).

In [None]:
import time


In [None]:
time.sleep(10)

In [None]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'
two_star_titles = []

for n in range(1,51):

    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)

    soup = bs4.BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")

    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

In [None]:
two_star_titles