# Web Scraping 101 (in-class)

*After finishing this tutorial, you can extract data from multiple pages on the web, and export such data to CSV files so that you can use it in an analysis. Plan a few hours to work through this notebook. Taking a few breaks inbetween keeps you sharp! Enjoy!*

--- 

## Learning Objectives

* Understand the difference between headless and browser emulation and ability to apply both methods (using `requests` and `selenium`)
* Generate seeds (“sampling”) 
* Navigating on a website using URLs and clicking
* Implement timers and modularise extraction code
* Write loops to execute data collections in bulk using functions
* Store data in CSV or JSON files, and enrich with relevant metadata

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


# 1. Making different types of website requests

In previous tutorials, you have used the `requests` library to retrieve web data. For example, re-run the following code.



In [12]:
import requests
from bs4 import BeautifulSoup

user_agent = {'User-agent': 'Mozilla/5.0'}
request = requests.get('https://books.toscrape.com/catalogue/sharp-objects_997/index.html', headers = user_agent)
source_code = request.text

# save website 
f=open('simple_website.html','w',encoding='utf-8')
f.write(source_code)
f.close()

# parse some information
soup=BeautifulSoup(source_code)
soup.find('h1')

<h1>Sharp Objects</h1>

This works well for relatively simple websites, but... try the same for the homepage of Twitch!

In [11]:
request = requests.get('https://www.twitch.tv/', headers = user_agent)
source_code = request.text
soup=BeautifulSoup(source_code)

# save website 
f=open('advanced_website.html','w',encoding='utf-8')
f.write(source_code)
f.close()

When trying to open `advanced_website.html` in your browser, you quickly realize there is a problem. You can't see what's on the website when you manually open it using the URL. This mainly has to do with how advanced a website is: in the case of Twitch, you'd encounter quite a dynamic site with a video player, previews, real-time updates on the number of streams, etc. The normal request library isn't just able to handle it. 

So, we're resorting to an alternative way to retrieve data, using `selenium`.

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>


In [62]:
# Installing and starting up Chrome using Webdriver Manager
!pip install webdriver_manager
!pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Opening the Twitch site
driver = webdriver.Chrome(ChromeDriverManager().install())

url = "https://twitch.tv/"
driver.get(url)

You should consider upgrading via the '/Users/hannesdatta/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/hannesdatta/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m




Current google-chrome version is 103.0.5060
Get LATEST driver version for 103.0.5060
Driver [/Users/hannesdatta/.wdm/drivers/chromedriver/mac64/103.0.5060.134/chromedriver] found in cache


If everything went smooth, your computer opened a new Chrome window, and opened `twitch.tv`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>



From now onwards, you can use `driver.get('https://google.com')` to point to different websites (i.e., you don't need to install it over and over again, unless you open up a new instance of Jupyter Notebook).

We can now also try to extract information. Note that we're converting the source code of the site to a `beautifulSoup` object (because you may have learnt how to use `BeautifulSoup` earlier.

In [18]:
# we also need the time package to wait a few seconds until the page is loaded
import time
url = "https://twitch.tv/"
driver.get(url)
time.sleep(3)


In [25]:
soup=BeautifulSoup(driver.page_source)

streams = soup.find_all('a', {'data-test-selector':"TitleAndChannel"})

# print a list of stream names
counter = 0
for stream in streams:
    counter = counter + 1
    print('Stream ' + str(counter) + ': ' + stream.get_text())


Stream 1: DROPS - BANDITS WITH DAD-  Answering New Player Questions -Lewpac
Stream 2: ☑️OMG ☑️CLICK NOW TO HEAR ABOUT IT ☑️CLICK NOW ☑️THIS IS CRAZY ☑️INSANE ROLEPLAYER GOD ☑️GAMEPLAY GOLEM ☑️REACTOR! ☑️xQc
Stream 3: 🔴7/24 DROPS ON🔴 !drops | discord.gg/kJDzdXdBKCMultiversusTR
Stream 4: Berlin döner = best döner | !socials $7sr !vpn !maprobcdee
Stream 5: Sipping coffee whilst providing an update on editing. Also featuring my pet ant.SovietWomble
Stream 6: Tandem Cycling to Norway? - Day 36 - Lauwersoog, NL | !trip !bike !map !merchHitch
Stream 7: IM BACK!! Lets see how top players perform on the new patch! !sellout to request a killer build! #ad !AWxNvidiaTrU3Ta1ent
Stream 8: Wampus 4 sub hour grind | No Drama Zone !hellofreshMudda_tm
Stream 9: NNO Gamer gamet um 21 Uhr mit anderen NNO Gamern gegen Alle für einen Gamer und hofft gut zu gamen um in den playoffs auch noch zu gamenTolkinLoL
Stream 10: Lets Have Some Fun - !youtubeFarshadSilent
Stream 11: I don't love StrayLimmy
Stream 12: 

Wow - this is cool. You've just learnt a second way to open websites using `selenium`. The benefit of `selenium` is that you can work with highly dynamic websites (which also helps you to not getting blocked). The drawback is that `selenium` is slower than just using the `requests` library, and it may sometimes be buggy on computers without a screen (which matters when you scale up your data collection.

<div class="alert alert-block alert-info"><b>Awesome stuff with Selenium</b> 

Selenium is your best shot at navigating a dynamic website. It can do amazing things, such as 
    
<ul>
    <li>"clicking" on buttons</li>
    <li>scrolling through a site</li>
    <li>hovering over items and capturing information from popups,</li>
    <li>starting to play a stream,</li>
    <li>typing text and submitting it in the chat, and</li>
    <li>so much more...!</li>
</ul>
    
Note though that we won't cover the advanced functionality of Selenium in this tutorial, but the optional "Web data advanced" tutorial holds the necessary information.
   
</div>



__Exercise 1.1__

Please write code snippets to extract the following pieces of information. Do you choose `requests` or `selenium`?

1. The titles of all `<h2>` tags from `https://odcm.hannesdatta.com/docs/course/`
2. The titles of all available TV series from `https://www.bol.com/nl/nl/l/series/3133/30291/` (about 24)

```
soup.find('a', class_='product-title')
```



In [None]:
# write your solution here

## 2. Seed Generation


__Importance__

So far, we've parsed some information (e.g., titles, product names, prices) from websites. What we haven't really done yet is decide for which products to obtain that information. Ideally, we would like to capture information for a sample of users, books, movies, series, etc. 

In web scraping, we typically refer to a "seed" as a starting point for a data collection. Without a seed, there's no data to collect.

For example, before we can crawl through all books available on [this site](https://books.toscrape.com/catalogue/category/books_1/index.html), we first need to generate a *list of all books on the page*.

One way to get there would be to:

1. first scrape all book links (“seeds”) from the overview page, and 
2. then iterate over all links to scrape the product description (or anything else on that page). 

Note that the overview page allows us to "navigate" to the individual book pages, either by clicking on the book cover or the book title (see red boxes in the figure below). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/books_links.png" align="left" width=80%/>

### 2.1 Collecting Links to use as seeds

Let's check out how the links from the book covers or book titles are encoded in the website's source code.

Open the [book catalogue](https://books.toscrape.com/catalogue/category/books_1/index.html), and inspect the underlying HTML code with the Chrome Inspector (right click --> inspect element). 

The book covers (`<img>`) are surrounded by `<a>` tags, which contain a link (`href`) to the book. 

Also, the book titles (`<h3>`) are surrounded by `<a>` tags with the relevant links to the book pages.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/inspector_links.png" align="left" width=80%/>

How could we tell a computer to capture the links to the various books on the site?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

__Exercise 2.1__

Please run the code cell below, which extracts all links (the `a` tag!), and prints the URL (`href`) to the screen. Don't worry, you don't need need to understand the code yet, we'll go over it line by line shortly!

If you look at these links more closely, you'll notice that we're not interested in many of these links... 

Make a list of all links we're *not* interested in (i.e., those *not* pointing to a book page). Which ones are those? Can you find out why they are there?

In [27]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url, headers = user_agent)
soup = BeautifulSoup(res.text, "html.parser")

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"): 
    print(link.attrs["href"])

../../../index.html
../../../index.html
index.html
../books/travel_2/index.html
../books/mystery_3/index.html
../books/historical-fiction_4/index.html
../books/sequential-art_5/index.html
../books/classics_6/index.html
../books/philosophy_7/index.html
../books/romance_8/index.html
../books/womens-fiction_9/index.html
../books/fiction_10/index.html
../books/childrens_11/index.html
../books/religion_12/index.html
../books/nonfiction_13/index.html
../books/music_14/index.html
../books/default_15/index.html
../books/science-fiction_16/index.html
../books/sports-and-games_17/index.html
../books/add-a-comment_18/index.html
../books/fantasy_19/index.html
../books/new-adult_20/index.html
../books/young-adult_21/index.html
../books/science_22/index.html
../books/poetry_23/index.html
../books/paranormal_24/index.html
../books/art_25/index.html
../books/psychology_26/index.html
../books/autobiography_27/index.html
../books/parenting_28/index.html
../books/adult-fiction_29/index.html
../books/humo

**Your answer**

...

__Solution__

The links we want to ignore are...

* "Books to Scrape" link at the top
* "Home" breadcrumb link 
* Left sidebar with all book genres (e.g., Travel)
* The next button at the bottom

These links are present on the page, because they are used by users to navigate on the page. This can also be seen on the animation:

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/books_overview.gif" align="left" width=50%/>

### 2.2 Collecting *More Specific* Links

__Importance__

We've just discovered that selecting elements by their tags gives us many irrelevant links. But, how can we narrow down these links, or, in other words, __how can we scrape only the book links we're interested in?__.

To answer this question, we need to briefly revisit the notion of __HTML classes__. 

A __class__ is often used as a reference in the code. For example, to make all text elements with a given class blue or increase the font size. In the Google Inspector screenshot shown earlier, you find an `<article>` tag with class `product_pod` in which a `<div>` is nested which contains the image and link attribute we're after. 

Every link to a book is *nested within this class* (nested = "part of"). The "wrong links" extracted above (i.e., the ones in the page's header and sidebar) are *not*. 

Thus, if we can tell our scraper that we're only interested in the `<a>` tags *within the `product_pod` class*, we end up with our desired selection of links.

__Let's try it out__

Like before, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we specify __a class (`class_=`)__, rather than an HTML tag. From the inspector, we know the class name (`product_pod`). 

This result is a list with __all 20 `product_pod` classes__ on the page (i.e., one for each book). 

Run the code below, in which we pick the __first book__ from the list (A Light in the Attic, element `[0]`), and extract the `<a>` tag nested within the `product_pod` class. 

Finally, we pull out the `href` attribute from the `<a>` tag which gives us the book link. Unlike the example above, we have selected only a single element (`[0]`) and therefore don't need to loop over all links with a `for`-loop.

In [28]:
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
user_agent = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=user_agent)
soup = BeautifulSoup(res.text, "html.parser")

# return the href attribute in the <a> tag nested within the first product class element
soup.find_all(class_="product_pod")[0].find("a").attrs["href"]

'../../a-light-in-the-attic_1000/index.html'

Note the `../../` in front of the link which tells the browser: this tells the browser to go back two directories from the current URL:
* Current URL: https://books.toscrape.com/catalogue/category/books_1/index.html
* 1 step back: https://books.toscrape.com/catalogue/category/books_1
* 2 steps back: https://books.toscrape.com/catalogue/category/

Thereafter, it appends `a-light-in-the-attic_1000/index.html` to the URL which forms the full link to the [A Light in the Attic](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) book. 

Pretty cool, right?

#### Exercise 2.2
1. Modify the script to extract the link from the *second book* (Tipping the Velvet), using BeautifulSoup.
2. Create a new variable `book_url` that combines the base URL (` https://books.toscrape.com/catalogue/`) and the string you extracted in the previous exercise (`../../a-light-....`). You can remove the `../../` by using the `.replace('../../', '')` function on the URL. The final URL needs to be: `https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html` 
3. Write a function to collect all links (seeds) from this page.

In [None]:
# your answer goes here!

#### Solutions

In [29]:
# Question 1
url_book = soup.find_all(class_="product_pod")[1].find("a").attrs["href"]
print(url_book)

../../tipping-the-velvet_999/index.html


In [30]:
# Question 2 
base_url = "https://books.toscrape.com/catalogue/" # gives a 403 error if you run the URL separately but works as expected once combined with the book url
book_url = base_url + url_book.replace('../../', '')
print(book_url)

https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html


In [33]:
# Question 3
def get_all_links(url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'):
    user_agent = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = user_agent)
    soup = BeautifulSoup(res.text, "html.parser")

    # return the href attribute in the <a> tag nested within the first product class element
    urls = soup.find_all(class_="product_pod")
    
    book_urls = []
    for book in urls:
        url_book = book.find("a").attrs["href"]
        base_url = "https://books.toscrape.com/catalogue/"
        book_url = base_url + url_book
        book_url = book_url.replace('../', '')
        book_urls.append(book_url)
    
    return(book_urls)

get_all_links()

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

## 3. Page Navigation

### 3.1. Using URLs

__Importance__

Alright - what have we learnt up this point?

We've learnt two ways to extract data (`requests` vs. `selenium`), and how to extract seeds from a page.

So... what's missing?

Exactly! The [`books.toscrape.com`](https://books.toscrape.com/catalogue/category/books_1/index.html) contains __1000 books__, spread across __50 pages__. 

So, the goal of this section is to navigate through the __entire book assortment__, not only the first 20 books!

__Let's try it out__

Open [the website](https://books.toscrape.com/catalogue/category/books_1/index.html), and click on the "next" button at the bottom of the page.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/books.png" align="left" width=60%/>


Repeat this a couple of times, and observe how the URL in your navigation bar is changing...

- `https://books.toscrape.com/catalogue/category/books_1/page-1.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-2.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-3.html`

Can you guess the next one...?

Indeed! The URL can be divided into a __fixed base part__ (`https://books.toscrape.com/catalogue/category/books_1/`), and a __counter__ that is dependent on the page you're visiting (e.g., `page-1.html`). 

__Now let's create a list of all 50 URLs!__ 

First, we create a counter variable, which we now set to 1 (but it can take on any value later on). Then, we concatenate the `base_url` with the counter (note that we have to convert the integer counter to a string before we can do that, using the `str` function).

In [34]:
counter = 1
full_url = base_url + "page-" + str(counter) + ".html" 
print(full_url)

https://books.toscrape.com/catalogue/page-1.html


In a similar fashion, we generate a list of 50 `page_urls` with a for loop that starts at 1 and ends at 50 (not 51!). 

In [35]:
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = []

for counter in range(1, 51):
    full_url = base_url + "page-" + str(counter) + ".html" 
    page_urls.append(full_url)

As expected, this gives a list of all page URLs that contain books. 

In [36]:
# print the last five page urls (btw, run print(page_urls) for yourself to see all page URLs!)
print("The number of page urls in the list is: " + str(len(page_urls)))

The number of page urls in the list is: 50


#### Exercise 3.1
In this exercise, we practice generating a seed for another website, [`quotes.toscrape.com`](https://quotes.toscrape.com/), which displays 100 famous quotes from GoodReads, categorized by tag. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/quotes.png" align="left" width=60% style="border: 1px solid black" />

1. Make yourself comfortable with how the [site](https://quotes.toscrape.com) works and ask yourself questions such as: how does the navigation work, how many pages are there, what is the base URL, and how does it change if I move to the next page?
2. Generate a list `quote_page_urls` that contains the page URLs we need if we'd like to scrape all 100 quotes.

In [None]:
# your answer goes here!

#### Solutions
1. The 100 quotes are evenly spread across 10 pages. The base URL is `https://quotes.toscrape.com/page/` followed by a page number between 1 and 10.

In [37]:
# Question 2
base_url = "https://quotes.toscrape.com/page/"
quote_page_urls = []

for counter in range(1, 11):
    full_url = base_url + str(counter)
    quote_page_urls.append(full_url)

print(quote_page_urls)

['https://quotes.toscrape.com/page/1', 'https://quotes.toscrape.com/page/2', 'https://quotes.toscrape.com/page/3', 'https://quotes.toscrape.com/page/4', 'https://quotes.toscrape.com/page/5', 'https://quotes.toscrape.com/page/6', 'https://quotes.toscrape.com/page/7', 'https://quotes.toscrape.com/page/8', 'https://quotes.toscrape.com/page/9', 'https://quotes.toscrape.com/page/10']


### 3.2 Using links contained in elements (e.g., buttons)

__Importance__

For now, the book link extraction has worked without problems. Yet, there's still one little improvement that we can make. *If the number of pages changes*, we need to manually update for how many pages we would like to retrieve seeds.

A general solution is therefore to look up whether there is a `next` button on the page (see HTML code below). We can then either "grab" the URL and visit it (so, in essence, we're still using URLs to navigate), or - instead - "click" on it.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/next_page.png" align="left" width=60% style="border: 1px solid black" />

__Let's try it out__

So, let's write a snippet that "captures" the link of the next page button.

We always proceed in small steps.

In [44]:
# Step 1: Load the website's source code and convert to BeautifulSoup object
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
user_agent = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = user_agent)
soup = BeautifulSoup(res.text, "html.parser")

In [45]:
# Step 2: Trying to locate the "next" class.
soup.find(class_='next')

<li class="next"><a href="page-2.html">next</a></li>

In [46]:
# Step 3: Trying to locate the <a> tag within the "next" class

In [47]:
soup.find(class_='next').find('a')

<a href="page-2.html">next</a>

In [48]:
# Step 4: Trying to extract the link ('href' attribute)
soup.find(class_='next').find('a')['href']

'page-2.html'

At each iteration, we can observe how we're getting closer to the information we need.

Now, we only need to combine the base URL with the page number.

In [52]:
base_url = 'https://books.toscrape.com/catalogue/category/books_1/'
next_page = soup.find(class_='next').find('a')['href']
base_url + next_page

'https://books.toscrape.com/catalogue/category/books_1/page-2.html'

__Exercise 3.2__

Please first load the snippet below, which has wrapped the "next page" capturing in a function. Observe the use of `try` and `except`, which accounts for the last page NOT having a next page button.

In [None]:
base_url = 'https://books.toscrape.com/catalogue/category/books_1/'

def next_page(url):
    user_agent = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = user_agent)
    soup = BeautifulSoup(res.text, "html.parser")
    try:
        next_page = soup.find(class_='next').find('a')['href']
    except:
        next_page = 'no next page'
    return(base_url + next_page)



1. Pass `https://books.toscrape.com/catalogue/page-49.html` to `next_page()` and observe the output. Then, use  `https://books.toscrape.com/catalogue/page-50.html`. Is that what you expected? 

2. Write a while loop that assembles a list of all product pages for the book category (`'https://books.toscrape.com/catalogue/category/books_1/'`), by extracting next page URLs from each page and appending them to an array/list called `urls`.


In [58]:
# write your code here

__Solution__

In [60]:
urls = []
url = base_url

while True:
    print('Trying to get next page URL from ' + url)
    next_url = next_page(url)
    if 'no next page' in next_url: break
    url = next_url
    urls.append(url)
    
urls

Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-5.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-6.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-7.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-8.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-9.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-10.html
Trying to get next p

['https://books.toscrape.com/catalogue/category/books_1/page-2.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-3.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-4.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-5.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-6.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-7.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-8.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-9.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-10.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-11.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-12.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-13.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-14.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-15.html',
 'https://book

### 3.3 Using interactive elements (e.g., by clicking buttons)

__Importance__

For more dynamic websites, we may have to click on certain elements (rather than extracting some URL).

__Try it out__

If you haven't done so, rerun the installation code for `selenium` from above. Then, proceed by running the following cell and observe what happens in your browser.


In [63]:
driver.get('https://books.toscrape.com/catalogue/category/books_1/')

After a few seconds, your browser will have loaded the website in Chrome. Now, run the next cells.

In [66]:
# Step 1: Let's try location the element
from selenium.webdriver.common.by import By
driver.find_element(By.CLASS_NAME, 'next')

<selenium.webdriver.remote.webelement.WebElement (session="b91f5618425843d7f2a25da8f50a3651", element="92256b27-cc69-4967-8aa5-158572f78c15")>

In [73]:
# Step 2: Finding the link within the `next` class
driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a')

<selenium.webdriver.remote.webelement.WebElement (session="b91f5618425843d7f2a25da8f50a3651", element="b044c417-c0b2-4de2-9249-011a4eaa9475")>

In [76]:
# Step 3: Clicking the link!
driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a').click()

Boom! In step 3, we finally clicked on the link. Just try rerunning this cell with step 3 over and over again. Does iterating through the pages work?!

__Exercise 3.3__

Iterate through the entire set of pages, until there are no new pages left. This time, use `selenium` and click on the next page button. You can start on page 47 (`https://books.toscrape.com/catalogue/category/books_1/page-47.html`) to speed up this exercise a bit.

Make use of the `time.sleep(2)` function to make the code wait a bit after each page load.


__Solution__

In [78]:
import time
urls = []
driver.get('https://books.toscrape.com/catalogue/category/books_1/page-47.html')
time.sleep(1)

while True:
    try:
        driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a').click()
        time.sleep(1)
    except:
        break

### 3.4 Collecting all seeds

Up to this moment, we have defined what seeds are (crucially important for sampling!), and introduced several ways through which you can navigate on a site. The only thing that's missing is combining these two things: navigating through all of the available pages, and collecting seeds for which we can later extract data.

__Exercise 3.4__

Using the solution from exercise 3.2, write code that navigates through all pages of the book category and stores product URLs in a list of dictionaries, containing the following data points:
- product URL
- URL from which page the product URL was captured
- current time stamp


__Solution__

In [92]:
import time
seeds = []
base_url = 'https://books.toscrape.com/catalogue/category/books_1/'
url = base_url #initialize for first page
counter = 0 #initialize counter so that you can break earlier from this loop when needed

while True:
    counter+=1
    
    #if (counter>4): break # deactivate this comment if you want to break after x iterations for prototyping
    
    print('Trying to get next page URL from ' + url)
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    
    # extract information
    urls = soup.find_all(class_="product_pod")
    for book in urls:
        url_book = book.find("a").attrs["href"]
        book_url = "https://books.toscrape.com/catalogue/" + url_book
        book_url = book_url.replace('../', '')
        seeds.append({'product_url': book_url,
                      'page_url': url,
                      'timestamp': int(time.time())})
    # next page available?
    try:
        url = base_url + soup.find(class_='next').find('a')['href']
    except:
        break # no next page present
    
seeds

Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html


[{'product_url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1658753809},
 {'product_url': 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1658753809},
 {'product_url': 'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1658753809},
 {'product_url': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1658753809},
 {'product_url': 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1658753809},
 {'product_url': 'https://books.toscr

## 4. Data Extraction


### 4.1 Timers

__Importance__

Before we started running some of the cells above, you may have observed the usage of the `time.sleep` function. Sending many requests at the same time can overload a server. Therefore, it's highly recommended to pause between requests rather than sending them all simultaneously. This avoids that your IP address (i.e., numerical label assigned to each device connected to the internet) gets blocked, and you can no longer visit (and scrape) the website. 

__Let's try it out__

In Python, you can import the `time` module, which pauses the execution of future commands for a given amount of time. For example, the print statement after `time.sleep(5)` will only be executed after 3 seconds:

In [79]:
# run this cell again to see the timer in action yourself!
import time
time.sleep(3)
print("I'll be printed to the console after 3 seconds!")

I'll be printed to the console after 3 seconds!


__Exercise 4.1__

Modify the code above to sleep for 2 minutes. Go grab a coffee inbetween. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button)

In [80]:
# your answer goes here!

**Solution**  

In [81]:
time.sleep(2*60)
print("Done!")

Done!


### 4.2 Modularization

**Importance**  

In scraping, many things have to be executed *multiple times*. For example, whenever we open a new page on books.toscrape.com, we would like to extract all the available book links.

To help us execute things over and over again, we will "modularize" our code into functions. We can then call these functions whenever we need them. Another benefit from using functions is that we can improve the readability and reusability of our code. If you need a quick refresher on functions, please revisit section 4 of the [Python Bootcamp](https://odcm.hannesdatta.com/docs/tutorials/pythonbootcamp/).

**Let's try it out**

Let's finish up our book URL scraper by putting together everything we have learned thus far.

1. We need a function that extracts all seeds, given a category URL. We would like to store these seeds in a JSON file.
2. We need a function that opens this JSON file, and captures all of the relevant product information (for now, let's use the title and price).

__Exercise 4.2__

Write a function to accomplish (1) above? (capturing the seeds and storing them in a JSON file)? Start with the solution in 3.4.

__Solution__

In [98]:
import time

def get_seeds(base_url = 'https://books.toscrape.com/catalogue/category/books_1/'):
    seeds = []
    url = base_url #initialize for first page
    counter = 0 #initialize counter so that you can break earlier from this loop when needed

    while True:
        counter+=1

        if (counter>4): break # deactivate this comment if you want to break after x iterations for prototyping

        print('Trying to get next page URL from ' + url)
        user_agent = {'User-agent': 'Mozilla/5.0'}
        res = requests.get(url, headers = user_agent)
        soup = BeautifulSoup(res.text, "html.parser")

        # extract information
        urls = soup.find_all(class_="product_pod")
        for book in urls:
            url_book = book.find("a").attrs["href"]
            book_url = "https://books.toscrape.com/catalogue/" + url_book
            book_url = book_url.replace('../', '')
            seeds.append({'product_url': book_url,
                          'page_url': url,
                          'timestamp': int(time.time())})
        # next page available?
        try:
            url = base_url + soup.find(class_='next').find('a')['href']
        except:
            break # no next page present

    return(seeds)

data = get_seeds('https://books.toscrape.com/catalogue/category/books_1/')

import json
f = open('seeds.json','w',encoding = 'utf-8')
for item in data:
        f.write(json.dumps(item))
        f.write('\n')
f.close()

Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html


__Exercise 4.3__

Now, let's write some code that loads `seeds.json`, and visits each of the websites to extract the product title and price. Remember to build in a little timer (e.g., waiting for 1 second). The prototype/starting code below stops automatically after 5 iterations to minimize server load. Try removing the prototyping condition using the comment character `#` when you think you're done!


In [105]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    print(obj['product_url'])
    
    # eventually sleep for a second
    time.sleep(1)
    

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


__Solution__

In [110]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time
import requests
from bs4 import BeautifulSoup

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    
    counter = counter + 1
    if counter>5: break
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    print(obj['product_url'])
    
    user_agent = {'User-agent': 'Mozilla/5.0'}
    req = requests.get(obj['product_url'], headers = user_agent)
    soup = BeautifulSoup(req.text)
    
    retrieved_data = {'title': soup.find('h1').get_text(),
                      'price': soup.find(class_='price_color').get_text(),
                      'timestamp_retrieval': int(time.time())}
        
    f = open('book_data.json', 'a', encoding = 'utf-8')
    f.write(json.dumps(retrieved_data))
    f.write('\n')
    f.close() 
    
    # eventually sleep for a second
    time.sleep(1)
 

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


In [115]:
# inspect data in pandas

import pandas as pd
pd.read_json('book_data.json', lines=True)

Unnamed: 0,title,timestamp_retrieval,price
0,Product Description,2022-07-25 13:44:09,
1,Product Description,2022-07-25 13:44:11,
2,Product Description,2022-07-25 13:44:12,
3,Product Description,2022-07-25 13:44:14,
4,Product Description,2022-07-25 13:44:15,
5,Product Description,2022-07-25 13:44:42,
6,Product Description,2022-07-25 13:44:44,
7,Product Description,2022-07-25 13:44:45,
8,Product Description,2022-07-25 13:44:46,
9,Product Description,2022-07-25 13:44:48,


### 4.3 Wrap-up

At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 


## After-class exercises

### Exercise 1

Extending the code written for exercise 4.3 in "Web data 101", please collect seeds from ten self-chosen product categories and store them in a file called `all_seeds.json`.

### Exercise 2

Please use the code written in exercise 4.4 in "Web Data 101" and extend it so capture more information (e.g., not only title and price, but also as other attributes/data points you are interested in. In particular, try getting the product description!

Try running your code and store the product data in a JSON dictionary called `all_books.json`.

### Exercise 3

Please complete an entire data collection project in a `.py` file, capturing data for 10 product categories and all products contained on all of the pages. You can proceed in two steps: first collect the seeds, then obtain all data. In addition, parse all retrieved data to a CSV file (with rows and columns), using `pd.read_json(filename, lines = True)` for reading in the JSON data, and `pd.to_csv(filename)` for saving the data in tabular format.

Run your data collection from the terminal.

The final deliverable is
- `all_seeds.json`
- `all_books.json`
- `all_books.csv`




## Backup: Executing Python Files

### Jupyter Notebooks versus Spyder

Jupyter Notebooks are ideal for combining programming and markdown (e.g., text, plots, equations), making it the default choice for sharing and presenting reproducible data analyses. Since we can execute code blocks one by one, it's suitable for developing and debugging code on the fly. 

That said, Jupyter Notebooks also have some severe limitations when using them in production environments. That's where an "Integrated Development Environment" (IDE) comes in, such as Spyder or PyCharm. A fancy word, we know. So, let's revisit the most important differences.

First, the order in which you run cells within a notebook may affect the results. While prototyping, you may lose sight of the top-down hierarchy, which can cause problems once you restart the kernel (e.g., a library is imported after it is being used). Second, there is no easy way to browse through directories and files within a Jupyter Notebook. Third, notebooks cannot handle large codebases nor big data remarkably well. 

That's why we recommend starting in Jupyter Notebooks, moving code into functions along the way, and once all seems to be running well, copy-paste all necessary code into Spyder. From there, you can save it as a Python file (`.py`) - rather than a notebook (`.ipynb`) - and execute the file from the command line. In this tutorial, we introduce you to the Spyder IDE and learn how to run Python files from the command line. The reason we choose for the Spyder IDE instead of PyCharm, for example, is because Spyder is already installed with Anaconda. In the future, you can always use PyCharm or another text editor to write your python scripts if you prefer! 

### Introduction to Spyder
The first time you need to click on the green "Install" button in Anaconda Navigator, after which you start Spyder by clicking on the blue "Launch" button (alternatively, type `spyder` in the terminal). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/anaconda_navigator.png" width=90% align="left" style="border: 1px solid black" />


The main interface consists of three panels: 
1. **Code editor** = where you write Python code (i.e., the content of code cells in a notebook)
2. **Variable / files** = depending on which tab you choose either an overview of all declared variables (e.g. look up their type or change their values) or a file explorer (e.g., to open other Python files)
3. **Console** = the output of running the Python script from the code editor (what normally appears below each cell in a notebook)

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/spyder.png" width=90% align="left" style="border: 1px solid black" />

**Let's try it out!**     
Copy the solution from exercise 4.3 to a new file, called `webscraping_101.py`. To run the script you can

- click on the green play button to run all code, or
- highlight the parts of the script you want to execute and then click the run selection button.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/toolbar.png" width=40% align="left" style="border: 1px solid black" />

Once the script is running, you may need to interrupt the execution because it is simply taking too long or you spotted a bug somewhere. Click on the red rectangular in the console to stop the execution. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/interrupt.gif" width=80% align="left" style="border: 1px solid black" />

### Run Python Files 

__For Mac and Linux users__

1. Open the terminal and navigate to the folder in which the `.py` file has been saved (use `cd` to change directories and `ls` to list all files).
2. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/running_python.gif" width=60% align="left" style="border: 1px solid black" />

__For Windows users__

1. Open Windows explorer and navigate to the folder in which the `.py` file has been saved. Type `cmd` to open the command prompt. Alternatively, open the command prompt from the start menu (and use `cd` to change directories and `dir` to list files).
2. Activate Anaconda by typing `conda activate`.
3. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).