# Web Scraping 101 (oDCM)

*After finishing this tutorial, you can extract data from multiple pages on the web, and export such data to CSV files so that you can use it in an analysis. Plan a few hours to work through this notebook. Taking a few breaks inbetween keeps you sharp! Enjoy!*

--- 

## Learning Objectives

* Generate lists of entities to scrape data from
* Map navigation path on a website using URLs, and understand how to use parameters to modify results
* Select data for extraction on a website using tags, class names and attributes
* Write data to CSV file, and enrich with relevant metadata
* Bundle data capture in Python functions and modularize extraction code
* Loop through a list of URLs to capture data in bulk, using functions
* Understand the difference between Jupyter Notebooks and “raw” Python files, and run collection via the command line/terminal

--- 

## Acknowledgements
This tutorial has been inspired by various open-access online resources, which we list for further reference at the [course website](https://odcm.hannesdatta.com/docs/about). 

--- 

## Support Needed?
For technical issues outside of scheduled classes, please check the [support section](https://odcm.hannesdatta.com/docs/course/support) on the course website. 

---

## 1. Seed Generation


### 1.1 Collecting Links


__Importance__

In web scraping, we typically refer to a "seed" as a starting point for a data collection. Without a seed, there's no data to collect.

For example, before we can crawl through all books available on [this site](https://books.toscrape.com/catalogue/category/books_1/index.html), we first need to generate a *list of all books on the page*.

One way to get there would be to:

1. first scrape all book links (“seeds”) from the overview page, and 
2. then iterate over all links to scrape the product description (or anything else on that page). 

Note that the overview page allows us to "navigate" to the individual book pages, either by clicking on the book cover or the book title (see red boxes in the figure below). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/books_links.png" align="left" width=80%/>

__Let's try it out__

Let's now check out how the links from the book covers or book titles are encoded in the website's source code.

Open the [book catalogue](https://books.toscrape.com/catalogue/category/books_1/index.html), and inspect the underlying HTML code with the Chrome Inspector (right click --> inspect element). 

The book covers (`<img>`) are surrounded by `<a>` tags, which contain a link (`href`) to the book. 

Also, the book titles (`<h3>`) are surrounded by `<a>` tags with the relevant links to the book pages.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/inspector_links.png" align="left" width=80%/>

How could we tell a computer to capture the links to the various books on the site?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

__Exercise 1__

Please run the code cell below, which extracts all links (the `a` tag!), and prints the URL (`href`) to the screen. Don't worry, you don't need need to understand the code yet, we'll go over it line by line shortly!

If you look at these links more closely, you'll notice that we're not interested in many of these links... 

Make a list of all links we're *not* interested in (i.e., those *not* pointing to a book page). Which ones are those? Can you find out why they are there?

In [None]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"): 
    print(link.attrs["href"])

**Your answer**

...

__Solution__

The links we want to ignore are...

* "Books to Scrape" link at the top
* "Home" breadcrumb link 
* Left sidebar with all book genres (e.g., Travel)
* The next button at the bottom

These links are present on the page, because they are used by users to navigate on the page. This can also be seen on the animation:

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/books_overview.gif" align="left" width=50%/>

### 1.2 Collecting *More Specific* Links

__Importance__

We've just discovered that selecting elements by their tags gives us many irrelevant links. But, how can we narrow down these links, or, in other words, __how can we scrape only the book links we're interested in?__.

To answer this question, we need to briefly revisit the notion of __HTML classes__. 

A __class__ is often used as a reference in the code. For example, to make all text elements with a given class blue or increase the font size. In the Google Inspector screenshot shown earlier, you find an `<article>` tag with class `product_pod` in which a `<div>` is nested which contains the image and link attribute we're after. 

Every link to a book is *nested within this class* (nested = "part of"). The "wrong links" extracted above (i.e., the ones in the page's header and sidebar) are *not*. 

Thus, if we can tell our scraper that we're only interested in the `<a>` tags *within the `product_pod` class*, we end up with our desired selection of links.

__Let's try it out__

Like before, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we specify __a class (`class_=`)__, rather than an HTML tag. From the inspector, we know the class name (`product_pod`). 

This result is a list with __all 20 `product_pod` classes__ on the page (i.e., one for each book). 

Run the code below, in which we pick the __first book__ from the list (A Light in the Attic, element `[0]`), and extract the `<a>` tag nested within the `product_pod` class. 

Finally, we pull out the `href` attribute from the `<a>` tag which gives us the book link. Unlike the example above, we have selected only a single element (`[0]`) and therefore don't need to loop over all links with a `for`-loop.

In [None]:
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

# return the href attribute in the <a> tag nested within the first product class element
soup.find_all(class_="product_pod")[0].find("a").attrs["href"]

Note the `../../` in front of the link which tells the browser: this tells the browser to go back two directories from the current URL:
* Current URL: https://books.toscrape.com/catalogue/category/books_1/index.html
* 1 step back: https://books.toscrape.com/catalogue/category/books_1
* 2 steps back: https://books.toscrape.com/catalogue/category/

Thereafter, it appends `a-light-in-the-attic_1000/index.html` to the URL which forms the full link to the [A Light in the Attic](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) book. 

Pretty cool, right?

#### Exercise 2
1. Modify the script to extract the link from the *second book* (Tipping the Velvet), using BeautifulSoup.
2. Create a new variable `book_url` that concatenates the base URL (` https://books.toscrape.com/catalogue/`) and the string you extracted in the previous exercise 1.2 (`../../a-light-....`). Use *slicing* to remove the `../../` part inbetween. The final output should be: `https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html` 
3. The `replace` functions offers a more convenient way to "search and replace" in a string. The syntax is: `my_string = old_string.replace('text-to-replace', 'replace-by-text')`. Implement the `replace` function for the previous exercise 2.2.

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1
url_book = soup.find_all(class_="product_pod")[1].find("a").attrs["href"]
print(url_book)

In [None]:
# Question 2 
base_url = "https://books.toscrape.com/catalogue/" # gives a 403 error if you run the URL separately but works as expected once combined with the book url
book_url = base_url + url_book[6:] # so we skip characters with index 0, 1, 2, 3, 4, 5: "../../"
print(book_url)

In [None]:
# Question 3
base_url = "https://books.toscrape.com/catalogue/"
book_url = base_url + url_book
book_url = book_url.replace('../', '')
print(book_url)

---

### 1.3 Iterating over items

__Importance__

Ideally, we'd like our code to extract the URL from *every* book on the page, not just *one* product.

In other words, we need a way to *iterate*/*loop* through the entire page to assemble a list of links (product pages) to scrape.

__Let's try it out__

Let's set up this exercise.

1. We have a BeautifulSoup object, holding all of the book previews (`soup.find_all(class_="product_pod")`)
2. We have an empty array of `book_urls`, that we would like to fill
3. We write a loop, which iterates through 1. and fills in 2.

Run the code below!

In [None]:
# list of all books on the overview page
books = soup.find_all(class_="product_pod")
book_urls = []

for book in books: 
    book_url = book.find("a").attrs["href"]
    book_urls.append(book_url)
    
# print the first five urls
print(book_urls[0:5])

In practice, it may be more convenient to create a *dictionary* in which the `book_title` is the key and the `book_url` the value. This way it is more intuitive to look up the URL from a given book because you don't have to remember the exact position in the list but can simply pass it the title of the book. 

In the Google Inspector screenshot at the beginning of this section, you can see that the book title is stored in the `alt` attribute of the `<img>` tag (as well as in the `title` attribute from the second `<a>` tag). Using a similar approach as above, we collect the `book_title` and `book_url` of each book, and use these records to update `book_list`.

In [None]:
book_list = []

for book in books: 
    book_title = book.find("img").attrs["alt"] 
    book_url = book.find("a").attrs["href"]
    book_list.append({'title': book_title,
                      'url': book_url})

As a result, we can simply pass the book title (mind the capitals!) to the following code snippet to obtain the corresponding URL.

In [None]:
next((book for book in book_list if book["title"] == "A Light in the Attic"), None)

#### Exercise 3
1. Like exercise 2.2, write code that transforms the relative URLs (`../..`) in `book_list` into full URLs, stored in `full_url`. Tip: you can use `for id, book in enumerate(book_list):` to iterate over the dictionaries and update URLs accordingly. 
2. One of the books on `books.toscrape.com` is [Black Dust](https://books.toscrape.com/catalogue/black-dust_976/index.html). What happens once you search for it using the code snippet above? Why is that? 

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1
for id, book in enumerate(book_list):
    book["full_url"] = (base_url + book["url"]).replace('../','')

# show the first five elements
book_list[0:5]

In [None]:
# Question 2 
next((book for book in book_list if book["title"] == "Black Dust"), None)

# it does not return any result because the book does not exist (this book is on shown on the 2nd page and we only scraped the first one!)

---

### 1.4 Page Navigation

__Importance__

Alright - what have we learnt up this point?

- Section 1.1 taught us how to extract links from a page, 
- Section 1.2 taught us how to extract *more specific links* from a page, and finally
- Section 1.3 taught us how to assemble a list of *links* to *all* books listed on a specific page.

So... what's missing?

Exactly! The [`books.toscrape.com`](https://books.toscrape.com/catalogue/category/books_1/index.html) contains __1000 books__, spread across __50 pages__. 

So, the goal of this section is to navigate through the __entire book assortment__, not only the first 20 books.



__Let's try it out__

Open [the website](https://books.toscrape.com/catalogue/category/books_1/index.html), and click on the "next" button at the bottom of the page.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/books.png" align="left" width=60%/>


Repeat this a couple of times, and observe how the URL in your navigation bar is changing...

- `https://books.toscrape.com/catalogue/category/books_1/page-1.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-2.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-3.html`

Can you guess the next one...?

Indeed! The URL can be divided into a __fixed base part__ (`https://books.toscrape.com/catalogue/category/books_1/`), and a __counter__ that is dependent on the page you're visiting (e.g., `page-1.html`). 

__Now let's create a list of all 50 URLs!__ 

First, we create a counter variable, which we now set to 1 (but it can take on any value later on). Then, we concatenate the `base_url` with the counter (note that we have to convert the integer counter to a string before we can do that, using the `str` function).

In [None]:
counter = 1
full_url = base_url + "page-" + str(counter) + ".html" 
print(full_url)

In a similar fashion, we generate a list of 50 `page_urls` with a for loop that starts at 1 and ends at 50 (not 51!). 

In [None]:
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = []

for counter in range(1, 51):
    full_url = base_url + "page-" + str(counter) + ".html" 
    page_urls.append(full_url)

As expected, this gives a list of all page URLs that contain books. 

In [None]:
# print the last five page urls (btw, run print(page_urls) for yourself to see all page URLs!)
print("The number of page urls in the list is: " + str(len(page_urls)))

#### Exercise 4
In this exercise, we practice generating a seed for another website, [`quotes.toscrape.com`](https://quotes.toscrape.com/), which displays 100 famous quotes from GoodReads, categorized by tag. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/quotes.png" align="left" width=60% style="border: 1px solid black" />

1. Make yourself comfortable with how the [site](https://quotes.toscrape.com) works and ask yourself questions such as: how does the navigation work, how many pages are there, what is the base URL, and how does it change if I move to the next page?
2. Generate a list `quote_page_urls` that contains the page URLs we need if we'd like to scrape all 100 quotes.

In [None]:
# your answer goes here!

#### Solutions
1. The 100 quotes are evenly spread across 10 pages. The base URL is `https://quotes.toscrape.com/page/` followed by a page number between 1 and 10.

In [None]:
# Question 2
base_url = "https://quotes.toscrape.com/page/"
quote_page_urls = []

for counter in range(1, 11):
    full_url = base_url + str(counter)
    quote_page_urls.append(full_url)

print(quote_page_urls)

### 1.4 Wrap-Up
In summary, we have defined our seed and thought about a data extraction strategy to obtain the book links on a page. Since there are multiple pages, we needed to generate a list of URLs as an input for our scraper, which we'll further refine in the next chapter.  

--- 

## 2. Data Extraction


### 2.1 Timers

__Importance__

Before we start running the scraper, we need to realize that sending many requests at the same time can overload a server. Therefore, it's highly recommended to pause between requests rather than sending them all simultaneously. This avoids that your IP address (i.e., numerical label assigned to each device connected to the internet) gets blocked, and you can no longer visit (and scrape) the website. 

__Let's try it out__

In Python, you can import the `sleep` module, which pauses the execution of future commands for a given amount of time. For example, the print statement after `sleep(5)` will only be executed after 5 seconds:


In [None]:
# run this cell again to see the timer in action yourself!
from time import sleep
sleep(5)
print("I'll be printed to the console after 5 seconds!")

__Exercise 5__

Modify the code above to sleep for 2 minutes. Go grab a coffee inbetween. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button)

In [None]:
# your answer goes here!

**Solution**  

In [None]:
sleep(2*60)
print("Done!")

---

### 2.2 Modularization

**Importance**  

In scraping, many things have to be executed *multiple times*. For example, whenever we open a new page with books, we would like to extract all the available book links.

To help us execute things over and over again, we will "modularize" our code into functions. We can then call these functions whenever we need them. Another benefit from using functions is that we can improve the readability and reusability of our code. If you need a quick refresher on functions, please revisit section 4 of the [Python Bootcamp](https://odcm.hannesdatta.com/docs/tutorials/pythonbootcamp/) tutorial.

**Let's try it out**

Let's finish up our book URL scraper by putting together everything we have learned thus far.

First, we define a function `generate_page_urls()` that takes a base URL and an upper limit of the number of pages (`num_pages`) as input parameters. This way, we can easily update our scraper if more books are added or if the base URL changes (e.g., change `num_pages` from `5` to `6` if we also want to include the 6th page). 

In [None]:
def generate_page_urls(base_url, num_pages):
    '''generate a list of full page urls from a base url and counter that has takes on the values between 1 and num_pages'''
    page_urls = []
    
    for counter in range(1, num_pages + 1):
        full_url = base_url + "page-" + str(counter) + ".html"
        page_urls.append(full_url)
        
    return page_urls

Try running the function and modifying its parameters (e.g., set the number of book pages to `10` rather than `5`).

In [None]:
generate_page_urls("https://books.toscrape.com/catalogue/category/books_1/", 5)


Second, let's define an `extract_book_urls()` function, which takes a list of page URLs (`page_urls`; like the one above!) as input and returns a list of dictionaries with book titles and URLs. Note the two-step structure of the for-loops: on every page (`page_url`), we create a `books` object which we subsequently loop over by extracting the `book_title` (e.g., `A Light in the Attic`) and `book_url` (e.g., `https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html`) from each book. These records are added to the list `book_list` which is eventually returned by the function. Make sure to fully understand this function line by line before moving on!

In [None]:
def extract_book_urls(page_urls):
    '''collect the book title and url for every book on all page urls'''
    book_list = []
    
    # collect all books on page_url
    for page_url in page_urls: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        books = soup.find_all(class_="product_pod")
        
        # for each book on that page look up the title and url and store it in a list
        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = "https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"].replace('../','')
            book_list.append({"title": book_title,
                             "url": book_url}) 
            
        sleep(1)  # pause 1 second after each request
            
    return book_list

So, let's try out this function. Be aware that running it takes some time.

In [None]:
# this cell references functions in other cells, therefore make sure you have loaded all cells above first! (Cell > Run All Above)
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = generate_page_urls(base_url, 2) # to save time and resources we only scrape the first 2 pages
book_list = extract_book_urls(page_urls)

In [None]:
# Preview the results
book_list[0:5]

__Exercise 6__

1. Please obtain a list of URLs for products stored on the first *five* pages. 
2. Please extend the `extract_books_url` function to also obtain information on whether the book is in stock. Make use of this code snippet to search for the particular class: `book.find("p", class_="class-name-to-search-for")`
3. Please clean the text snippet obtained in 2 by removing (a) the unnecessary line breaks (`\n`), and spaces (`" "`), using Python's `replace` function. Finally, test your function!

In [None]:
# Your answer goes here

**Solutions**  

In [None]:
# Question 1

base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = generate_page_urls(base_url, 5) 
book_list = extract_book_urls(page_urls)
book_list

In [None]:
# Question 2
def extract_book_urls(page_urls):
    '''collect the book title and url for every book on all page urls'''
    book_list = []
    
    # this part is the same as above
    for page_url in page_urls: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        books = soup.find_all(class_="product_pod")

        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = ("https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"]).replace('../','')
            book_instock = book.find("p", class_="instock availability").text # only this changed!
            book_list.append({"title": book_title,
                             "url": book_url,
                             "instock": book_instock}) # and this line!
            
        sleep(1)  
            
    return book_list

In [None]:
# Question 3
def extract_book_urls(page_urls):
    '''collect the book title and url for every book on all page urls'''
    book_list = []
    
    for page_url in page_urls: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        books = soup.find_all(class_="product_pod")

        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = "https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"].replace('../','')
            book_instock = book.find("p", class_="instock availability").text
            
            # addition to clean up the text (the rest remains the same!)
            book_instock = book_instock.replace('\n','').replace(' ','') # first replace a line-break (`\n`) by an empty space, then replace a space (' ') by an empty space
            
            book_list.append({"title": book_title,
                             "url": book_url,
                             "instock": book_instock})
            
        sleep(1) 
            
    return book_list

# test function!
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = generate_page_urls(base_url, 2) 
book_list = extract_book_urls(page_urls)
book_list

---

### 2.3 Next Page Button

__Importance__

For now, the book link extraction has worked without problems. Yet ,there's still one little improvement that we can make. *If the number of pages changes*, we need to manually update the `num_pages` parameter. For example, we may miss out once new books are added which appear on page 51 and further. 

A general solution is therefore to look up whether there is a `next` button on the page (HTML code below). If so, it means a next page exists, and we keep on incrementing the page counter by 1. If not, it means we have reached the last page. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/next_page.png" align="left" width=60% style="border: 1px solid black" />

__Let's try it out__

So, let's write a function (`check_next_page()`), which takes an URL as an input and returns the outgoing link of the next button (if present):

In [None]:
def check_next_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    next_btn = soup.find(class_= "next") # observe the similarity with the code snippet used above
    return next_btn.find("a").attrs["href"] if next_btn else None

page_1 = "https://books.toscrape.com/catalogue/page-1.html"
print("The next page is: " + check_next_page(page_1))

#### Exercise 7
1. Pass `https://books.toscrape.com/catalogue/page-50.html` to `check_next_page()` and observe the output. Is that what you expected? 
2. Write a function `next_page_url()` that that checks whether the output of `check_next_page()` is not equal to `None` (i.e., anything but `None`). If so, it should return a new variable `page_url` that concatenates the base URL and the relative path to the next page. If not, it should print the statement `This is already the last page!`. Tip: make use of `if`/`else` statements.

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1 
output = check_next_page("https://books.toscrape.com/catalogue/page-50.html")
print(output) # the output is None because page 50 is the last one

In [None]:
# Question 2 
def next_page_url(url):
    base_url = "https://books.toscrape.com/catalogue/"
    if url != None: 
        page_url = base_url + url 
        return page_url 
    else: 
        print("This is already the last page!")
        
next_page_url(check_next_page("https://books.toscrape.com/catalogue/page-50.html"))

---
### 2.4 Combining everything in one function

__Importance__

Our scraper so far consists of a function that extracts books from the page (`extract_books_urls()`), a function to check whether a next page is available (`next_page_url`), and a function that looks up the next URL (`check_next_page()`). 

As a last step, we can now integrate these functions into one *overarching* function. Instead of generating the list of page URLs up front, we use a `while` loop that remains `True` as long as there is another new page. At the end of each loop, we update the `page_url` according to the link of the next button (using `check_next_page()`). On the last page, there is no new page URL and thus we break out of the while loop. We've added a print statement at the beginning of the `while` loop, so that you can observe the progress of the scraper while it is running.

All in all, we have modularized our code into functions, made it future-proof (e.g. if new books are added), and reduced the number of lines of code to get the job done! 

In [None]:
def extract_all_books(page_url):
    books = []
    while page_url:
        print(page_url)
        for book in extract_book_urls([page_url]):
            books.append(book)
        
        if check_next_page(page_url) != None: 
            page_url = "https://books.toscrape.com/catalogue/category/books_1/" + check_next_page(page_url)
        else: 
            break
        
        # if "page-4" in page_url: break # (activate this if you don't want to run the entire loop)
    return books

__Let's try it out__

Run the cell below to see the scraper in action! You may need to wait for a bit as the scraper loops through all 50 pages. If you don't feel like taking a coffee break, remove the `#` sign in front of the `if` statement to abort the function before it starts with page 4.

In [None]:
book_list = extract_all_books("https://books.toscrape.com/catalogue/page-1.html")

In [None]:
book_list

#### Exercise 8

After having run the scraper above, inspect the output yourself, and then answer the following questions.

1. How many books are there in `books`? Does this align with your initial expectations?
2. A friend recommended the book `The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics`. After looking up the reviews on [GoodReads](https://www.goodreads.com/book/show/25986790-the-activist-s-tao-te-ching?ac=1&from_search=true&qid=jpcvOsxKfP&rank=1), you decide to look for a copy of the book online. Does [books.toscrape.com](books.toscrape.com) offer a copy in their store? If so, do they have enough stock currently?
3. How many books are in stock currently?
4. How many books are there that have the word "boat" (lower or upper case) in their title?

In [None]:
# your answer goes here!

#### Solutions


In [None]:
# Question 1
# There are 1000 books 

len(books)

#That's 50 pages into 20 products, which matches our expectations.

In [None]:
books[0]

In [None]:
# Question 2
# we use one of the code snippets from above to search for the title

next((book for book in book_list if book["title"] == 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics'), None)

# we can view the URL and open it in the browser.

In [None]:
# Question 3
books_instock = [book for book in book_list if book["instock"] == "Instock"]
len(books_instock)

# All books are in stock!

In [None]:
# Question 4
len([book for book in book_list if "boat" in book["title"].lower()])

# here, we're checking for the appearence of the word "boat" in the title.

In case you haven't done so, it's time to take a break now! Enjoy!

---

### 2.5 Page-Level Data Collection

**Importance**   
Do you remember trying to obtain the URL of the [Black Bust](https://books.toscrape.com/catalogue/black-dust_976/index.html) book in exercise 2? Let's see whether it works this time... (you have to run the entire code above for 50 pages!)

In [None]:
[book for book in book_list if book["title"] == "Black Dust"]

Excellent, it works flawlessly! But, why did we need the book URLs in the first place? It forms the seed for other web scraping efforts. For example, the product descriptions can only be obtained from the book pages themselves which means we need to loop over all book URLs to extract the right information. 

**Let's try it out!**  

In the follow-up exercise, we'll look at how to do this. So... open the [website](https://books.toscrape.com/catalogue/black-dust_976/index.html) in your browser, and run the code cell below to extract the number of reviews for that particular book.

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/black-dust_976/index.html')
soup = BeautifulSoup(res.text, "html.parser")
len(soup.find(id="content_inner").find("p", class_ = "star-rating").find_all(class_ = "icon-star"))

After running the cell, inspect the website's source code in Chrome, and try to understand the extraction code above. A good way to do so is to break down your extraction code in small chunks, and run them after another.

In [None]:
soup

This gave you the entire source code of the website - not so useful as a starting point, so let's zoom in on the what is labeled in the source as the "Start of product page"

In [None]:
soup.find(id="content_inner")

This one already looks better, scrolling down just a little bit gives us already the title and price of the product. Can you find this information also in Chrome's Inspector Tool?

Let's proceed by zooming in even more...

In [None]:
soup.find(id="content_inner").find_all("p")

We've now filtered for all content items with tag `p`, and can spot the target class: "star-rating"! So let's go there...

In [None]:
soup.find(id="content_inner").find_all("p", class_ = "star-rating")

Wow - so many star ratings! The list contains the star rating of the product (overall), and the reviewer's individual star rankings. Let's just extract the first star rating for now.

In [None]:
soup.find(id="content_inner").find("p", class_ = "star-rating")

Much better. But... where can we see the number of stars? It's in the class name ("star-rating Five"), but we can also just count the number of "icon-star" classes in the code above).

In [None]:
soup.find(id="content_inner").find("p", class_ = "star-rating").find_all(class_ = "icon-star")

The last thing to do is to count how many items are in that class, by using the `len` function.

In [None]:
len(soup.find(id="content_inner").find("p", class_ = "star-rating").find_all(class_ = "icon-star"))

What a journey! We hope you enjoyed this little exploration activity, and are ready for the next exercisese.

#### Exercise 9
1. Please write a function `get_book_description` to extract the product description for the first five books in `book_list`.  

2. Run the function and inspect the output. If you look carefully, you may spot `tÃ©gÃ©` symbols throughout the product description. Look up the original text on the book pages and compare it side-by-side with the output below.

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1
def get_book_description(books):
    book_descriptions = []
    
    for book in books: 
        page_url = book["url"]

        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")

        # tip: look at the Google Inspector screenshot below 
        description = soup.find(id="content_inner").find_all("p")[3].get_text()
        title = soup.find(id="content_inner").find('img')['alt']
        book_descriptions.append({'url': page_url,
                                  'title': title,
                                  'description': description})
    return book_descriptions

book_descriptions = get_book_description(book_list[0:5])
book_descriptions

# Question 2
# tÃ©gÃ© (or similarly encoded strings) are characters from languages other than English, which use an extended character space.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/black_dust.png" align="left" width=70% style="border: 1px solid black" />

---
### 2.6 Scraping to a CSV file

**Importance**  
Lastly, we convert the list of dictionaries into a Comma Separated Values (CSV) file, which you can open up in any spreadsheet program (e.g., Excel). 

More specifically, we'd like to have a file with three columns, containing:
- the book title, 
- the product description, and 
- the current date and time. 

The latter helps you to distinguish between data from scrapers you run repeatedly. For example, you may run the book scraper at the beginning of every month to keep track of price changes of any of the books. Although you could store the data of each extraction moment into a separate file (e.g., `2021_01_01_book_prices.csv` for January 2021, `2021_02_01_book_prices.csv` for February 2021), we recommend always including a timestamp column to your scraped datasets. After all, losing or overwriting data can be disastrous (especially for scrapers) as you may never be able to obtain historical data (e.g., the price of a book 2 months ago).

In that light, we import the `datetime` library which contains a function `now()` that automatically determines the current date and time which we'll incorporate into our final dataset. Run the cell a few times, and observe how the values update to the current time: 

In [None]:
from datetime import datetime

now = datetime.now()
print(now)

In essence, CSV-files are simply text files with symbols that indicate the beginning of a new column (i.e., delimiter). Below you find a screenshot of the `book_descriptions.csv` file opened in a basic text editor. Every `;` and enter (empty line) indicate the start of a new column and row, respectively.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/csv_files.png" align="left" width=50% style="border: 1px solid black" />

Excel then applies this logic - converting semicolons and empty lines - to assign the data points to their respective cells: 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/excel.png" align="left" width=50% style="border: 1px solid black" />

It gets more complicated once the delimiter has been embodied into data. For example, a comma is sometimes also used as a delimiter, but that would not work here because the product description also contains commas (e.g., `No matter how busy he keeps himself, successful Broadway...`). In that case, the part after the comma (`successful Broadway...`) would be regarded as a new column, whereas it actually still belongs to the product description. For that reason, setting the delimiter to `;` is a safer choice here. In practice, tabs "\t" are also frequently used.

**Let's try it out!**   
We can write to a text file with the `csv` library. The first row is the header and contains the three column names (`"title", "description", "date_time"`). Thereafter, we iterate over the list and add the current date time to it. Importantly, the `w` flag in the `with` statement indicates that the file will be overwritten every time the cell is executed. If you, however, want to append data to an existing file and avoid losing historical data, you can swap `w` for `a`. 

In [None]:
import csv 

with open("book_descriptions.csv", "w", encoding = 'utf-8') as csv_file: # <<- this is the line with the "flag"l see exercises below
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["title", "description", "date_time"])
    now = datetime.now()
    for book in book_descriptions: # here we reference the book_descriptions list - make sure it's loaded otherwise you get an error! (Cell > Run All Above)
        writer.writerow([book['title'], book['description'], now])
print('done!')

#### Exercise 10
1. Run the cell above and look at the `book_descriptions.csv` file in Excel. Make sure it looks like the screenshot above (3 columns x 4 rows). Depending on the language settings on your machine, the data may not be correctly distributed over the columns. In that case, go to the "Data" tab in Excel, click the "Text to Columns" button in the ribbon, choose "Delimited", put a checkmark in front of "Semicolon", and choose "Finish".

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/text_to_column.gif" align="left" width=60% style="border: 1px solid black" />

2. Close Excel, change the flag to `a`, and run the cell again. Open the `book_descriptions.csv` file again (and repeat the Text to Columns procedure if necessary). How does the output differ from the previous step? Why is that? 

In [None]:
# your answer goes here!

#### Solutions  
It shows the same data, including the header, twice (below one another). It goes beyond the scope of this course to define better alternatives (e.g., save data to a database).

---

### 2.8 Wrap-Up
At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 

Now that you have hopefully got the hang of using Jupyter Notebooks, we're going to introduce you to an alternative that goes hand in hand with what you have learned thus far, but overcomes some of its limitations.

## 3. Executing Python Files

### 3.1 Jupyter Notebooks versus Spyder

Jupyter Notebooks are ideal for combining programming and markdown (e.g., text, plots, equations), making it the default choice for sharing and presenting reproducible data analyses. Since we can execute code blocks one by one, it's suitable for developing and debugging code on the fly. 

That said, Jupyter Notebooks also have some severe limitations when using them in production environments. That's where an "Integrated Development Environment" (IDE) comes in, such as Spyder or PyCharm. A fancy word, we know. So, let's revisit the most important differences.

First, the order in which you run cells within a notebook may affect the results. While prototyping, you may lose sight of the top-down hierarchy, which can cause problems once you restart the kernel (e.g., a library is imported after it is being used). Second, there is no easy way to browse through directories and files within a Jupyter Notebook. Third, notebooks cannot handle large codebases nor big data remarkably well. 

That's why we recommend starting in Jupyter Notebooks, moving code into functions along the way, and once all seems to be running well, copy-paste all necessary code into Spyder. From there, you can save it as a Python file (`.py`) - rather than a notebook (`.ipynb`) - and execute the file from the command line. In this tutorial, we introduce you to the Spyder IDE and learn how to run Python files from the command line. The reason we choose for the Spyder IDE instead of PyCharm, for example, is because Spyder is already installed with Anaconda. In the future, you can always use PyCharm or another text editor to write your python scripts if you prefer! 

### 3.2 Introduction to Spyder
The first time you need to click on the green "Install" button in Anaconda Navigator, after which you start Spyder by clicking on the blue "Launch" button (alternatively, type `spyder` in the terminal). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/anaconda_navigator.png" width=90% align="left" style="border: 1px solid black" />


The main interface consists of three panels: 
1. **Code editor** = where you write Python code (i.e., the content of code cells in a notebook)
2. **Variable / files** = depending on which tab you choose either an overview of all declared variables (e.g. look up their type or change their values) or a file explorer (e.g., to open other Python files)
3. **Console** = the output of running the Python script from the code editor (what normally appears below each cell in a notebook)

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/spyder.png" width=90% align="left" style="border: 1px solid black" />

**Let's try it out!**     
In the `webscraping_101.py` file above, we have put together all code snippets from this notebook needed to scrape and store the URLs of all books. To run the script you either click on the green play button to run all code (from line 1 to 46). As an alternative, you can highlight the parts of the script you want to execute and then click the run selection button.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/toolbar.png" width=40% align="left" style="border: 1px solid black" />

Once the script is running, you may need to interrupt the execution because it is simply taking too long or you spotted a bug somewhere. Click on the red rectangular in the console to stop the execution. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/interrupt.gif" width=80% align="left" style="border: 1px solid black" />

#### Exercise 11
1. Download the Python [webscraping_101.py](https://odcm.hannesdatta.com/docs/tutorials/webscraping101/webscraping_101.py) script (right-click, download linked file as…) and store it in the same directory as the `.ipynb` notebook file (`py` = Python script; `.ipynb` = Jupyter notebook). 
1. Start Spyder  and open the `webscraping101.py` (`File` > `Open`) script (so not the notebook!). Compare this notebook and the Python script in Spyder side-by-side: which do you find clearer? 
2. Run the script and then open the `book_urls.csv` file in Excel. Where is the file stored on your computer? How many records are there?

In [None]:
# your answer goes here!

#### Solutions
1. It remains a personal opinion, but we'd say the `.py` looks neater because all the code is in the same view (e.g., all import statements below each other rather than spreading them throughout your notebook)
2. Exported files appear in the same working directory (unless specified differently). The `book_urls.csv` file contains 1000 rows (999 records and 1 header row).

### 3.3 Run Python Files 
* *Mac*
    1. Open the terminal and navigate to the folder in which the `.py` file has been saved (use `cd` to change directories and `ls` to list all files).
    2. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webscraping101/images/running_python.gif" width=60% align="left" style="border: 1px solid black" />

* *Windows*
    1. Open Windows explorer and navigate to the folder in which the `.py` file has been saved. Type `cmd` to open the command prompt. Alternatively, open the command prompt from the start menu (and use `cd` to change directories and `dir` to list files).
    2. Activate Anaconda by typing `conda activate`.
    3. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

### 3.4 Wrap-up

Congrats! You've made it, and learnt so much. Take a step back now, let it sink in, and then get creative on how you could use the skills you've learnt. 

This is the end of this tutorial. Keep up the good work!