# Girls Who Code - The Python Series
## Web Scraping
## Mentor - Amir ElTabakh


**Welcome** to the third Python workshop of the Fall 21 semester! Today we're going to learn about Web Scraping.

Web scraping is the process of extracting data from websites. For our webscrape, we will use Chrome Developer Tools to identify HTML components. We will also use the Python BeautifulSoup and Splinter libraries to automate a web browser and gather the data we've identified.

Web scraping is a method of collecting data from different web sources quickly instead of manually visiting each one, which can be time consuming. This is our first taste in using Python to really automate a process. There are many steps to collecting data from the web, which we'll go over today.

In this workshop we will go over:
- HTML code and how websites are structured
- Write a Python script that automates exploring the web with Splinter
- Collect data from a website using BeautifulSoup

Splinter is the tool that automates a web browser. It's pretty cool to see your computer navigate the web all on its own.
BeautifulSoup will extract the data needed for analysis.

## HTML

HTML is a coding language used for creating webpages. It’s built using specific tags and arranging them in a nested order, a bit like building blocks. For example, if we wanted a header and a paragraph in the same section of a webpage, we would nest `<h1 />` and `<p />` tags inside a `<div />` tag, with the `<div />` tag acting as a box to hold the other pieces.

```
<div>
   <h1>Hello, world!</h1>
   <p>This is a great beginning.</p>
</div>
```

Think of a webpage as a window into the internet. HTML is the glass, boards, and blinds on that window. Just like there are many sizes and shapes to windows, each webpage has been customized to present users with a view into a different topic. Consider a weather report delivered through a weather site. Think of a news source or social media platform. Each of these examples are all built using custom HTML. Our first step will be to explore that design so that we can write a script that knows what it's looking at when it interacts with a webpage.

Open VS Code and create a file named index.html. This file can be saved to your desktop because it's just for practice.

In this blank HTML file put an exclamation point on the first line and press Enter. This should autofill the editor to contain everything we need for a basic HTML page.

```
<!DOCTYPE html>
<html lang="en">
<head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <meta http-equiv="X-UA-Compatible" content="ie=edge">
 <title>Document</title>
</head>
<body>
</body>
</html>
```

Most elements have opening and closing tags, which are identical except for the forward slash that begins the closing tag. The closing tags represent the end of that HTML element.

Let's define each HTML tag shown in the graphic:

1. `<!DOCTYPE html>` is a declaration, not a tag. It tells web browsers in which HTML version the document is written. This should always be the first line in an HTML document.

2. `<head>` is the opening tag that serves as a container for the setup elements. Jupyter Notebook imports occur in the top cell whereas Python imports occur at the top of the code. HTML imports (e.g., a stylesheet or a library) will be within the `<head>`.

3. `<meta>` is short for "metadata" and tells the web browser basic information, such as page width.

4. `<title>` and `</title>` are the opening and closing tags that serve as a container for the page title displayed on the tab at the top of your web browser. In the example above, the title is "Document" and would appear like so in the browser:

5. `</head>` is the closing tag for the `<head>` tag, much like the end of a code block in Python.

6. `<body>` and `</body>` are opening and closing tags. They also serve as a container, but for data we can see (navigation menus, lists, and paragraphs).

7. `<html lang=”en”>` and `</html>` are opening and closing tags that serve as a container for all elements within an HTML page.


An easy way to keep the tags in visual order is by using indentation. Containers nested within other containers are indented by two to four spaces. This helps to keep our code clean and easy to understand. Nesting is when HTML elements are contained within other elements. Picture a set of nesting dolls with each nested in proper order, by design, into the largest doll. It is the same for HTML tags—they must be in the correct order to not break the design of the webpage.

Let's take another look at this webpage, only with a few more elements added to it:

```
<!DOCTYPE html>
<html lang="en">
 <head>
   <meta charset="UTF-8" />
   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
   <meta http-equiv="X-UA-Compatible" content="ie=edge" />
   <title>Document</title>
 </head>
 <body>
   <h1>Hello, world!</h1>
   <p>
     Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin aliquet
     iaculis lorem non sollicitudin. Fusce elementum ac elit finibus auctor.
     Curabitur orci sem, accumsan a diam sit amet, efficitur tristique velit.
   </p>
   <ul>
     <li>First list item</li>
     <li>Second list item</li>
     <li>Third list item</li>
   </ul>
 </body>
</html>
```
There are several more tags within the <body /> container. Add this new code to your index.html file and save it. Then, open the file by navigating to it and double-clicking it. Now you have a simple static webpage open in your browser, built from scratch. It's not super exciting yet, but that's okay. It's the innards of the page we're focusing on right now.

Let's review the new tags:

1. `<h1 />` is a first-level header. The text in this tag will be displayed bigger and bolder than the rest of the page's text. There are many different headers available to use, from h1 to h6, with h1 returning the largest text.
2. `<p />` is a paragraph tag, currently holding lorem ipsum sentences. (lorem ipsum is dummy text used to stage websites). More can be read about it on the Lorem Ipsum reference website (Links to an external site.).
3. `<ul />` is an unordered list.
4. `<li />` is a list item.

_Question_: What does it mean when the `<li />` tags are inside the `<ul />` tags?

This is only a small taste of how many tags exist out there. Remember, these tags are all part of website customization. Without the variety available to use, websites would look plain and uninspired. The sites that Robin intends to scrape data from are far more sophisticated, using many more combinations of tags than what we've discussed here. Understanding the basic layout and how nesting and containers work is an important part of successful web scraping.

We know that when we scrape data from the web, we're simply pulling specific data from websites we've chosen. How do we specify the data? Let's say we want the latest news article from a Mars website. Before we can program our script to pull that data, we have to tell it where to look. Basically, our script would say, "look in this `<div />` tag, then look inside that for a `<p />` tag."

That's a simple way of putting it, visit W3Schools' developer site for an extensive list of [HTML tags](https://www.w3schools.com/tags/tag_comment.asp).

## Splinter

One of the fun things about web scraping is the automation—watching your script at work.

1. Once you execute your completed scraping script, a new Chrome web browser will pop up with a banner across the top that says "Chrome is being controlled by automated test software."
2. This message lets you know that your Python script is directing the browser. The browser will visit websites and interact with them on its own.
3. Depending on how you've programmed your script, your browser will click buttons, use a search bar, or even log in to a website.

In [None]:
!pip install splinter
!pip install bs4
!pip install webdriver_manager

In [2]:
# Importing Dependencies
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

In [None]:
# Mac users use this block of code

# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path)

In [3]:
# Windows users use this block of code
# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - 

[WDM] - Current google-chrome version is 95.0.4638
[WDM] - Get LATEST driver version for 95.0.4638
[WDM] - Driver [C:\Users\amira\.wdm\drivers\chromedriver\win32\95.0.4638.54\chromedriver.exe] found in cache


An HTML page can get very confusing very quickly, so lets practice on a less sophisticated site first. There are several sites available specifically for newly minted web scrapers to practice and hone their skills with Splinter and BeautifulSoup. These practice sites contain several different components that we'll encounter out in the wild: buttons to navigate, search bars, and nested HTML tags. It's a great introduction to how the tools we'll use work together to gather the data we want.

Lets scrape data from a website specifically created for practicing web scraping! Head to [Quotes to Scrape](https://quotes.toscrape.com/).

In [4]:
# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

# Visit the Quotes to Scrape site
url = 'http://quotes.toscrape.com/'
browser.visit(url)

# Parse the HTML
html = browser.html
html_soup = soup(html, 'html.parser')

# Scrape the Title
title = html_soup.find('h2').text
print("Title: " + title)

[WDM] - 

[WDM] - Current google-chrome version is 95.0.4638
[WDM] - Get LATEST driver version for 95.0.4638
[WDM] - Driver [C:\Users\amira\.wdm\drivers\chromedriver\win32\95.0.4638.54\chromedriver.exe] found in cache


Title: Top Ten tags


What we've just done in the last two lines of code is:

We used our html_soup object we created earlier and chained find() to it to search for the `<h2 />` tag.
We've also extracted only the text within the HTML tags by adding `.text` to the end of the code.
We've completed our first actual scrape. Let's practice again, this time using Splinter to scrape the actual tags to go with the title we just pulled.

In [5]:
# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

# Visit the Quotes to Scrape site
url = 'http://quotes.toscrape.com/'
browser.visit(url)

# Scrape the top ten tags
tag_box = html_soup.find('div', class_='tags-box')
# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)

[WDM] - 

[WDM] - Current google-chrome version is 95.0.4638
[WDM] - Get LATEST driver version for 95.0.4638
[WDM] - Driver [C:\Users\amira\.wdm\drivers\chromedriver\win32\95.0.4638.54\chromedriver.exe] found in cache


love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


This code looks really similar to our last, but we've increased the difficulty a bit by incorporating a for loop, but let's start at the beginning.

The first line, `tag_box = html_soup.find('div', class_='tags-box')`, creates a new variable `tag_box`, which will be used to store the results of a search. In this case, we're looking for `<div />` elements with a class of `tags-box`, and we're searching for it in the HTML we parsed earlier and stored in the html_soup variable.

The second line, `tags = tag_box.find_all('a', class_='tag')`, is similar to the first but with a few tweaks to make the search more specific. The new "tags" variable will hold the results of a `find_all`, but this time we're searching through the parsed results stored in our `tag_box` variable to find `<a />` elements with a `tag` class.

We used `find_all` this time because we want to capture all results, instead of a single or specific one.

Next, we've added a for loop. This for loop cycles through each tag in the `tags` variable, strips the HTML code out of it, and then prints only the text of each tag.

### Scrape Across Pages
Now that we've practiced scraping items from a single page, we're going to up the ante by scraping items that span multiple pages. Our next section of code will scrape the quotes on the first page, click the "Next" button, then scrape more quotes and so on (five pages worth of quotes).

The first two lines do two things: They assign an actual URL to the variable named "url" and then tell Splinter to visit that webpage. We'll create a for loop to collect each quote, "click" the next button, then collect the next set of quotes. We'll use `range(1, 6)` in our for loop to visit the first five pages of the website.

Go ahead and execute this cell. This will cause the automated browser to navigate there and run our script.

In [6]:
# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

url = 'http://quotes.toscrape.com/'
browser.visit(url)

for x in range(1, 6): # A for loop with five iterations
    
    html = browser.html # An HTML object assigned to the `html` variable
    
    quote_soup = soup(html, 'html.parser') # Use BeautifulSoup to parse the `html` object
    
    quotes = quote_soup.find_all('span', class_='text') # Use BeautifulSoup to find all `<span />` tags with a class of "text"
    
    for quote in quotes: # Print statements wrapped in another for loop
        print('page:', x, '----------')
        print(quote.text)
        
    browser.links.find_by_partial_text('Next') # Use Splinter to click the 'Next' button

[WDM] - 

[WDM] - Current google-chrome version is 95.0.4638
[WDM] - Get LATEST driver version for 95.0.4638
[WDM] - Driver [C:\Users\amira\.wdm\drivers\chromedriver\win32\95.0.4638.54\chromedriver.exe] found in cache


page: 1 ----------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
page: 1 ----------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
page: 1 ----------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
page: 1 ----------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
page: 1 ----------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
page: 1 ----------
“Try not to become a man of success. Rather become a man of value.”
page: 1 ----------
“It is better to be hated for what you are than to be loved for what you are not.”
page: 1 ----------
“I have not failed. I've just found 10,000 ways that won't work.”
page: 1 ----------
“A woman is like a tea bag; you never know how strong it is u

NASA has a very friendly Terms of Service (or ToS, also known as Terms of Use) when it comes to web scraping. So lets scrape some data from there! In the next cell of your Jupyter notebook, we'll assign the url and instruct the browser to visit it.

In [7]:
# Visit the mars nasa news site
url = 'https://mars.nasa.gov/news/'
browser.visit(url)
# Optional delay for loading the page
browser.is_element_present_by_css("ul.item_list li.slide", wait_time=1)

True

With the following line, `browser.is_element_present_by_css("ul.item_list li.slide", wait_time=1)`, we are accomplishing two things.

One is that we're searching for elements with a specific combination of tag (`ul` and `li`) and attribute (`item_list` and `slide`, respectively). For example, `ul.item_list` would be found in HTML as `<ul class=”item_list”>`.

Secondly, we're also telling our browser to wait one second before searching for components. The optional delay is useful because sometimes dynamic pages take a little while to load, especially if they are image-heavy.

In the next cell, we'll set up the HTML parser:

In [8]:
html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('ul.item_list li.slide')

Notice how we've assigned `slide_elem` as the variable to look for the `<ul />` tag and its descendent (the other tags within the `<ul />` element), the `<li />` tags? This is our parent element. This means that this element holds all of the other elements within it, and we'll reference it when we want to filter search results even further. The `.` is used for selecting classes, such as `item_list`, so the code `'ul.item_list li.slide'` pinpoints the `<li />` tag with the class of slide and the `<ul />` tag with a class of `item_list`. CSS works from right to left, such as returning the last item on the list instead of the first. Because of this, when using select_one, the first matching element returned will be a `<li />` element with a class of slide and all nested elements within it.

After opening the page in a new browser, right-click to inspect and activate your DevTools. Then search for the HTML components you'll use to identify the title and paragraph you want.

_Question_: Which HTML attribute will we use to scrape the article’s title?


There are two methods used to find tags and attributes with BeautifulSoup:
- `.find()` is used when we want only the first class and attribute we've specified.
- `.find_all()` is used when we want to retrieve all of the tags and attributes.

For example, if we were to use `.find_all()` instead of `.find()` when pulling the summary, we would retrieve all of the summaries on the page instead of just the first one.

In [9]:
# Scrape the title

# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

# Visit the mars nasa news site
url = 'https://mars.nasa.gov/news/'
browser.visit(url)

# Optional delay for loading the page
browser.is_element_present_by_css("ul.item_list li.slide", wait_time=1)

html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('ul.item_list li.slide')

slide_elem.find("div", class_='content_title')

[WDM] - 

[WDM] - Current google-chrome version is 95.0.4638
[WDM] - Get LATEST driver version for 95.0.4638
[WDM] - Driver [C:\Users\amira\.wdm\drivers\chromedriver\win32\95.0.4638.54\chromedriver.exe] found in cache


<div class="content_title"><a href="/news/9063/you-can-help-train-nasas-rovers-to-better-explore-mars/" target="_self">You Can Help Train NASA's Rovers to Better Explore Mars</a></div>

In this line of code, we chained `.find` onto our previously assigned variable, `slide_elem`. When we do this, we're saying, "This variable holds a ton of information, so look inside of that information to find this specific data." The data we're looking for is the content title, which we've specified by saying, "The specific data is in a `<div />` with a class of `'content_title'`."

The title is in that mix of HTML in our output—that's awesome! But we need to get just the text, and the extra HTML stuff isn't necessary. We'll add something new to our `.find()` method here: `.get_text()`. When this new method is chained onto `.find()`, only the text of the element is returned.

In [10]:
# Use the parent element to find the first `a` tag and save it as `news_title`
news_title = slide_elem.find("div", class_='content_title').get_text()
news_title

"You Can Help Train NASA's Rovers to Better Explore Mars"

In [11]:
# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_="article_teaser_body").get_text()
news_p

'Members of the public can now help teach an artificial intelligence algorithm to recognize scientific features in images taken by NASA’s Perseverance rover.'

Great job! Stretch your scraping skills by visiting [Books to Scrape](http://books.toscrape.com/) and scraping the book URL list on the first page.

Many websites don't want automated browsers visiting their sites and snagging data. If there are too many visits, the server hosting the site could get overloaded and shut down. Administrators can then ban the IP address of the person doing the scraping, making it more difficult to even manually visit the site to view data.

**IMPORTANT**\
Terms of Service and Terms of Use bring up an ethical issue when gathering data. Many websites don't allow automated browsing and scraping—some of the scraping scripts out there are designed to gather data quickly, and the constant traffic can overload web servers and disable a website.