# 10.3.1 Use Splinter

set the executable path and initialize a browser:

# Set up Splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)
With these two lines of code, we are creating an instance of a Splinter browser. This means that we're prepping our automated browser. We're also specifying that we'll be using Chrome as our browser. **executable_path is unpacking the dictionary we've stored the path in – think of it as unpacking a suitcase. headless=False means that all of the browser's actions will be displayed in a Chrome window so we can see them.

The third cell that initiates a Splinter browser may take a couple of seconds to finish, but an empty webpage should automatically open, ready for instructions. You'll know that it's an automated browser because it'll have a special message stating so

This browser now belongs to Splinter (for the duration of our coding,

Splinter provides us with many ways to interact with webpages. It can input terms into a Google search bar for us and click the Search button, or even log us into our email accounts by inputting a username and password combination.

In [1]:
# import our scraping tools: the Browser instance from splinter, the BeautifulSoup object, and the driver object for Chrome, ChromeDriverManager.
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
# set the executable path and initialize a browser
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - 

[WDM] - Current google-chrome version is 94.0.4606
[WDM] - Get LATEST driver version for 94.0.4606
[WDM] - There is no [win32] chromedriver for browser 94.0.4606 in cache
[WDM] - Get LATEST driver version for 94.0.4606
[WDM] - Trying to download new driver from https://chromedriver.storage.googleapis.com/94.0.4606.61/chromedriver_win32.zip
[WDM] - Driver has been saved in cache [C:\Users\lavin\.wdm\drivers\chromedriver\win32\94.0.4606.61]


# 10.3.2 Practice with Splinter and BeautifulSoup

Scrape the Title
Now let's scrape that title

This code tells Splinter which site we want to visit by assigning the link to a URL. After executing the cell above, we will use BeautifulSoup to parse the HTML

In [3]:
# Visit the Quotes to Scrape site
url = 'http://quotes.toscrape.com/'
browser.visit(url)

In [4]:
# Parse the HTML
html = browser.html
html_soup = soup(html, 'html.parser')

In [5]:
# Scrape the Title
title = html_soup.find('h2').text
title

'Top Ten tags'

# Scrape All of the Tags
the '<div />' container holding all of the tags has two classes. The col-md-4 class is a Bootstrap feature. Bootstrap is an HTML and CSS framework that simplifies adding functional components that look nice by default. In this case, col-md-4 means that this webpage is using a grid layout, and it's a common class that many webpages use

expand the tags-box div to take a look at the contents.

From here, we can see a list of <span /> elements, each with a class of tag-item. Open some of the <span /> elements to see what they contain; if you see <a /> elements with the names in the list that we're targeting, then we're in the right place. Search for tag-item and note the number of returned results. If there are 10, then we're ready to go.

The first line, tag_box = html_soup.find('div', class_='tags-box'), creates a new variable tag_box, which will be used to store the results of a search. In this case, we're looking for <div /> elements with a class of tags-box, and we're searching for it in the HTML we parsed earlier and stored in the html_soup variable.

The second line, tags = tag_box.find_all('a', class_='tag'), is similar to the first but with a few tweaks to make the search more specific. The new "tags" variable will hold the results of a find_all, but this time we're searching through the parsed results stored in our tag_box variable to find <a /> elements with a tag class.

We used find_all this time because we want to capture all results, instead of a single or specific one.

Next, we've added a for loop. This for loop cycles through each tag in the tags variable, strips the HTML code out of it, and then prints only the text of each tag.




In [11]:
# Scrape the top ten tags
tag_box = html_soup.find('div', class_='tags-box')
# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)

love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


# Scrape Across Pages
scraping items that span multiple pages. Our next section of code will scrape the quotes on the first page, click the "Next" button, then scrape more quotes and so on until we have scraped the quotes on five pages.

In the next cell, we'll create a for loop that will do the following:

Create a BeautifulSoup object
Find all the quotes on the page
Print each quote from the page
Click the "Next" button at the bottom of the page
We'll use range(1, 6) in our for loop to visit the first five pages of the website.

for x in range(1, 6):
    html = browser.html
    soup = BeautifulSoup (html, 'html.parser')
    
    quotes = soup.find_all('span', class_='text')

    for quote in quotes:
        print('page:', x, '----------')
        print(quote.text)
browser.links.find_by_partial_text('Next').click()
    Use Splinter to click the 'Next' buttom
    
BeautifulSoup can search for text, but the syntax is typically the same: we look for a tag first, then an attribute. We can search for items using only a tag, such as a <span /> or <h1 />, but a class or id attribute makes the search that much more specific.
    


In [12]:
# use range(1, 6) in our for loop to visit the first five pages of the website.
for x in range(1, 6):
   html = browser.html
   quote_soup = soup(html, 'html.parser')
   quotes = quote_soup.find_all('span', class_='text')
   for quote in quotes:
      print('page:', x, '----------')
      print(quote.text)
   browser.links.find_by_partial_text('Next').click()

page: 1 ----------
“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”
page: 1 ----------
“For every minute you are angry you lose sixty seconds of happiness.”
page: 1 ----------
“If you judge people, you have no time to love them.”
page: 1 ----------
“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”
page: 1 ----------
“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”
page: 1 ----------
“Today you are You, that is truer than true. There is no one alive who is Youer than You.”
page: 1 ----------
“If you want your children to be intelligent, read them fairy tales. If y