# Web Scraping Basic

**Web scraping** is a popular way to gather data online because it offers an inexpensive alternative to the traditional means of extracting information. With the bountiful amounts of data online it is essential to learn how to complie your own data cheaply and efficiently.

### There are three basic steps of web scraping:
1. Sending a get query to the website
2. Website returns an HTML based document containing all of the website's information.
3. Parsing the HTML document making it more navigable and extracting the data we actually want.

### Libraries:
1. **Beautiful Soup**: Creates a parsed and navigable html document
2. **lxml**: Uses for processing the HTML
3. **requests**: Creates the get query to the website


## Part I - Creating a Request and Parsing

In [1]:
# Import the dependencies
from bs4 import BeautifulSoup
import requests

In [2]:
# Setting the URL for scraping
url = 'http://quotes.toscrape.com'

In [3]:
# Create a request for the HTML document
response = requests.get(url)

In [4]:
# Parsing responses text attribute using lxml parser
soup = BeautifulSoup(response.text, 'lxml')

In [5]:
# Print the HTML document
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="t

## Part II - Exploring HTML Structure

In the previous section, we have pulled up all the HTML from quotes to scrape. Now it's time to zone in on the specific data we are looking to capture.

### Understanding HTML Structure:

#### Quotes to Scrape Web Page:

Go to the Quotes to Scrape website: http://quotes.toscrape.com/

<div>
    <img src="images/Quotes_to_Scrape.png"/>
</div>

#### Inspect the Web Page:

Right click anywhere on the page and select "**Inspect**".

<div>
    <img src="images/Quotes_to_Scrape_inspect.png"/>
</div>

#### Check the HTML Document:

Right click on any item on the page and select "**Inspect**" for the specific item. When hover the HTML code, the item on the web page will be highlighted.

<div>
    <img src="images/Quotes_to_Scrape_source.png"/>
</div>

**HTML** stands for **hyertext markup language**, which works by categorizing different elements of the HTML document with specific tags. HTML has many different tags but a general skeleton layout involves three basic ones, an **HTML**, a **head**, and a **body**, which helps organize the HTML document. For web scraping, we will be mostly focused on the information within the **body** tag.

In our examples, most of the information is comprised within the **span** and **div** tags. These tags are usually used for sectioning pieces of information. Additionally, most modern websites partner these tags with a CSS document, which is defined by the class atrribute.

## Part III - Isolate Data

#### Example:

Suppose we are interested to extract all the quotes, author name, and tags from each box, we need to first find the tag in the HTML document that contains the information that we need. Then, we can extract them out from the document.

<div>
    <img src="images/Quotes_to_Scrape_1.png"/>
</div>

To find the **quotes** from the HTML document, we inspect the page and we can find the quote is with the **span** tag in the class text. We are able to obtain justthe quote information by using the **fina_all()** function. The return object is a list of all the elements on the HTML website with the tag, **span** and the class, **text**.

In [6]:
# Extracting the elements with tag, span and the class, text from the HTML documents
quotes = soup.find_all('span', class_='text')
print(quotes)

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>, <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>, <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>, <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>, <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>, <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>, <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</spa

As you can see from the result, for each element, there are some extra HTML information that we don't need. From here, we are going to remove the HTML by just printing the text property.

In [7]:
# Create a loop to extract just the quote from quotes
for quote in quotes:
    print(quote.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


Let's go ahead and grab the author's names using the find_all() function. The steps are pretty much the same. First of all, we inspect the authors and we can see that their names are the text property within the small tag an the class, author. Then we extract the information with the find_all() function.

In [8]:
# Extracting the elements with tag, small and the class, author from the HTML documents
authors = soup.find_all('small', class_='author')
print(authors)

[<small class="author" itemprop="author">Albert Einstein</small>, <small class="author" itemprop="author">J.K. Rowling</small>, <small class="author" itemprop="author">Albert Einstein</small>, <small class="author" itemprop="author">Jane Austen</small>, <small class="author" itemprop="author">Marilyn Monroe</small>, <small class="author" itemprop="author">Albert Einstein</small>, <small class="author" itemprop="author">André Gide</small>, <small class="author" itemprop="author">Thomas A. Edison</small>, <small class="author" itemprop="author">Eleanor Roosevelt</small>, <small class="author" itemprop="author">Steve Martin</small>]


In [9]:
# Create a loop to print the quote and author's name
for i in range(0, len(quotes)):
    print(quotes[i].text)
    print(authors[i].text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.”
Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.”
André Gide
“I have not failed. I've just found 10,000 ways that won't work.”
Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Eleanor Roosevelt
“A day witho

Finally, we are using the same methodology as before to extract the tags for each quote. Let's inspect the tags and see where they are in the HTML document. We find each tag is contained with an **a** tag and the **class**, "tag". Furthermore, each quote can have multiple tags, so maybe grabbing this element isn't the best option. 
    
Instead of grabbing these sections, we would grab the **div** tag and **class** tag section. Each quote only has one tag's class, making this fit nicely with the existing loop.

In [10]:
# Extracting the elements with tag, div and the class, tags from the HTML documents
tags = soup.find_all('div', class_='tags')
print(tags)

[<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>, <div class="tags">
            Tags:
            <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>, <div class="tags">
            Tags:
            <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/life/page/1/">life</a>
<a class="tag" href="/tag/live/page/1/">live</a>
<a class="tag" href="/tag/miracle/page/1/">miracle</a>
<a class="tag"

In [11]:
# Create a loop to print the quote, author's name, tags
for i in range(0, len(quotes)):
    print("Quote: ", quotes[i].text)
    print("Author: ", authors[i].text)
    quoteTags = tags[i].find_all('a', class_='tag')
    for quoteTag in quoteTags:
        print("Tags: ", quoteTag.text)

Quote:  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author:  Albert Einstein
Tags:  change
Tags:  deep-thoughts
Tags:  thinking
Tags:  world
Quote:  “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author:  J.K. Rowling
Tags:  abilities
Tags:  choices
Quote:  “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author:  Albert Einstein
Tags:  inspirational
Tags:  life
Tags:  live
Tags:  miracle
Tags:  miracles
Quote:  “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author:  Jane Austen
Tags:  aliteracy
Tags:  books
Tags:  classic
Tags:  humor
Quote:  “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author:  Marilyn Monroe
Tags:  be-yourself
Tags:  inspirational
Quote:  “Try not to be

## Part IV - Preparing for Paginated Scraping

In practice, it's rarely the case to scrape data from a single page. Collecting data from multiple pages is an especially common task in web scraping. Online retailers, for instance, typically like to use umlti-page setups, as they can offer more products than the traditional single-page websites. To practice this technique, we are utilizing this amazing resource created by Michael Yen, the [Scraping Club](https://scrapingclub.com/).
<br>
<br>
<div>
    <img src="images/Scraping_Club.png"/>
</div>
<br>
<br>
We use **Exercise 3** to presents a sample e-commerce store and we can see by the pagination at the bottom of the site that all items cover multiple pages. Before we diveinto moving between pages with Python, we, first, set up a single-page web scraper and build from there.

In [12]:
# Import the dependencies
from bs4 import BeautifulSoup
import requests

In [13]:
# Assigne the URL path
url = 'https://scrapingclub.com/exercise/list_basic/?page=1'

In [14]:
# Create a request to the web page
response = requests.get(url)

In [15]:
# Parsing the HTML document
soup = BeautifulSoup(response.text, 'lxml')

Now we have HTML document from the website, we use the inspector tool to explore the page and find the item's name and price. The information is housed within the **div** tag in the **class**, "col-lg-4 col-md-6 mb-4", so we can start by looking for the information from this tags.
<br>
<br>
<div>
    <img src="images/Scraping_Club_1.png"/>
</div>
<br>
<br>
<br>
<br>
<div>
    <img src="images/Scraping_Club_2.png"/>
</div>
<br>
<br>

In [16]:
# Find the item from the HTML document
items = soup.find_all('div', class_='col-lg-4 col-md-6 mb-4')

In [17]:
# Iterate each item and extract the item's name and price
count = 1
for i in items:
    itemName = i.find('h4', class_='card-title').text.strip('\n')
    itemPrice = i.find('h5').text
    print('%s) Price: %s, Item Name: %s' % (count, itemPrice, itemName))
    count = count + 1

1) Price: $24.99, Item Name: Short Dress
2) Price: $29.99, Item Name: Patterned Slacks
3) Price: $49.99, Item Name: Short Chiffon Dress
4) Price: $59.99, Item Name: Off-the-shoulder Dress
5) Price: $24.99, Item Name: V-neck Top
6) Price: $49.99, Item Name: Short Chiffon Dress
7) Price: $24.99, Item Name: V-neck Top
8) Price: $24.99, Item Name: V-neck Top
9) Price: $59.99, Item Name: Short Lace Dress


## Part V - Scraping Paginated Content

We successfully scraped a single page on the exercise web page. It's time to take this to the next level and explore scraping data from multiple pages. We need to first go back to the website and find the page button.
<br>
<br>
<div>
    <img src="images/Scraping_Club_3.png"/>
</div>
<br>
<br>


In [18]:
# Find the page buttons
pages = soup.find('ul', class_='pagination')

In [19]:
# Create an empty list for storing the page urls
urls = []

In [20]:
# Find all the links to different pages
links = pages.find_all('a', class_='page-link')

In [21]:
# Iterate through all the link element
for link in links:
    # Find the page number through each link
    pageNum = int(link.text)if link.text.isdigit() else None
    # Append the url to the list if it exists
    if pageNum != None:
        x = link.get('href')
        urls.append(x)

In [22]:
# Check the urls
print(urls)

['?page=2', '?page=3', '?page=4', '?page=5', '?page=6', '?page=7']


In [23]:
# Add the page url to the original url
# Print all item names and prices from each page
for i in urls:
    newUrl = url + i
    response = requests.get(newUrl)
    soup = BeautifulSoup(response.text, 'lxml')
    items = soup.find_all('div', class_='col-lg-4 col-md-6 mb-4')
    count = 1
    for i in items:
        itemName = i.find('h4', class_='card-title').text.strip('\n')
        itemPrice = i.find('h5').text
        print('%s) Price: %s, Item Name: %s' % (count, itemPrice, itemName))
        count = count + 1

1) Price: $24.99, Item Name: Short Dress
2) Price: $29.99, Item Name: Patterned Slacks
3) Price: $49.99, Item Name: Short Chiffon Dress
4) Price: $59.99, Item Name: Off-the-shoulder Dress
5) Price: $24.99, Item Name: V-neck Top
6) Price: $49.99, Item Name: Short Chiffon Dress
7) Price: $24.99, Item Name: V-neck Top
8) Price: $24.99, Item Name: V-neck Top
9) Price: $59.99, Item Name: Short Lace Dress
1) Price: $24.99, Item Name: Short Dress
2) Price: $29.99, Item Name: Patterned Slacks
3) Price: $49.99, Item Name: Short Chiffon Dress
4) Price: $59.99, Item Name: Off-the-shoulder Dress
5) Price: $24.99, Item Name: V-neck Top
6) Price: $49.99, Item Name: Short Chiffon Dress
7) Price: $24.99, Item Name: V-neck Top
8) Price: $24.99, Item Name: V-neck Top
9) Price: $59.99, Item Name: Short Lace Dress
1) Price: $24.99, Item Name: Short Dress
2) Price: $29.99, Item Name: Patterned Slacks
3) Price: $49.99, Item Name: Short Chiffon Dress
4) Price: $59.99, Item Name: Off-the-shoulder Dress
5) Pri