## Basics of web scraping


<a id='introduction'></a>

![What is HTML?](http://designshack.designshack.netdna-cdn.com/wp-content/uploads/htmlbasics-0.jpg)

One of the largest sources of data in the world is all around us — the web. Most people consume the web in some form every day. One of the most powerful Python tool sets we'll learn allows us to extract and normalize data from unstructured sources such as web pages.  

**If you can see it, it can be scraped, mined, and put into a DataFrame.**

Before we begin the actual process of web scraping with Python, it's important to cover the basic constructs that describe HTML as unstructured data. 

We'll then cover a powerful selection technique called XPath and look at a basic workflow using a framework called [Scrapy](http://www.scrapy.org).

<a id='html'></a>

## Hypertext Markup Language (HTML)

---

In the HTML document object model (DOM), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

<a id='elements'></a>
### Elements
Elements begin and end with opening and closing "tags," which are defined by namespaced, encapsulated strings. These namespaces, which begin and end the elements, must be the same. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

When you have several different titles or paragraphs on a single page, you can assign ID values to namespaces to make more unique reference points. IDs are also useful for labelling nested elements.
```html
<title id ='title_1'>I am the first title.</title>
<p id ='para_1'>I am the first paragraph.</p>
<title id ='title_2'>I am the second title.</title>
<p id ='para_2'>I am the second paragraph.</p>
```


**Elements can have parents and children.**
It's important to remember that an element can be both a parent and a child — whether an element is referred to as a parent or child depends on the specific element you're referencing.


```html
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent.'
        <div id = 'child_2'>I am the child of 'child_1.'
            <div id = 'child_3'>I am the child of 'child_2.'
                <div id = 'child_4'>I am the child of 'child_4.'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```html
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2.'
        <div id = 'child_2'>I am the parent of 'child_3.'
            <div id = 'child_3'> I am the parent of 'child_4.'
                <div id = 'child_4'>I am not a parent. </div>
            </div>
        </div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes

HTML elements can also have attributes. They describe the properties and characteristics of elements. Some affect how the element behaves or looks in terms of the output rendered by the browser.

The most common element is an anchor element. Anchor elements often have an "href" element, which tells the browser where to go after it's clicked. An anchor element is typically formatted in bold type and is sometimes underlined as a visual cue to differentiate it.

**The markup that describes an element with attributes literally looks like this:**

```html
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this:**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

<a id='element-hierarchy'></a>
### Element Hierarchy

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally represented as:**

```html
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```

<a id='html-resources'></a>
### Additional HTML Resources

Read all about the different elements supported by modern browsers:
 * [HTML5 cheat sheet](http://websitesetup.org/html5-cheat-sheet/).
 * [Mozilla HTML element reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).
 * [HTML5 visual cheat sheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf).
 

<a id='practical'></a>

## Using Requests and Beautiful Soup to Extract Information From a Web Page

---

Beautiful Soup is a Python library that's useful for pulling data out of HTML and XML files. It works with many parsers, such as XPath, and can be executed in an IDE, meaning it can be easier to work with when first extracting information from HTML.

Please make sure that the required packages are installed: 

```bash
# Beautiful Soup:
> conda install beautifulsoup4
> conda install lxml

# Or if conda doesn't work:
> pip install beautifulsoup4
> pip install lxml
```

In [1]:
from bs4 import BeautifulSoup

In [11]:
soup = BeautifulSoup(open("sample.html"), "lxml")

In [41]:
soup.extract()

<!DOCTYPE html>
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<h1>Header 1</h1>
<h2>Header 2</h2>
<p>This is a paragraph</p>
<a href="https://www.google.com/">Google it!</a>
<h3>What's in a div?</h3>
<div class="divvy-it-up" id="foobar">
<p id="layer1">I'm in a div.  Yeah!</p>
<div>
<p id="layer2">I'm in a div, too!</p>
</div>
</div>
<div class="todo">
<ul>
<li> Take out trash</li>
<li> Walk dog</li>
</ul>
</div>
<div class="something">
<ol>
<li>One</li>
<li>Two</li>
</ol>
</div>
</body>
</html>

In [7]:
print(soup.title)
print(soup.title.text)

<title>Hello, World!</title>
Hello, World!


In [42]:
olist = soup.find('ol')
for list_item in olist.find_all('li'):
    print(list_item.text)

One
Two


In [27]:
soup.find_all('a')[0]['href']

'https://www.google.com/'

In [36]:
div_results = soup.body.find_all('div', {'class':'divvy-it-up'})
div_results[0].find_all('p', {'id':'layer2'})

[<p id="layer2">I'm in a div, too!</p>]

### Let's buy a car!



![](assets/craigslist.jpg)

https://atlanta.craigslist.org/atl/wan/d/cheap-or-free-running-car-or/6485015376.html

<a id='step1'></a>
### 1) Fetch the content by URL.


In [44]:
# You'll need the requests library in order to fully utilize bs4.
import requests
from bs4 import BeautifulSoup

# Target web page:
url = "https://atlanta.craigslist.org/atl/wan/d/cheap-or-free-running-car-or/6485015376.html"

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print('Status Code: ',response.status_code)

# Pull the HTML string out of requests and convert it to a Python string.
html = response.text

# The first 700 characters of the content.
print("\nFirst part of HTML document fetched as string:\n")
print(html[:700])

Status Code:  200

First part of HTML document fetched as string:

<!DOCTYPE html>
<html class="no-js">
<head>
<title>Cheap or free running car or truck - wanted - by owner - sale</title>
    	<link rel="canonical" href="https://atlanta.craigslist.org/atl/wan/d/cheap-or-free-running-car-or/6485015376.html">
	<meta name="description" content="Hi, I am looking for someone who would like to sell or give away a free car or truck. I have a vehicle its been down for sometimes now and will cost to much to fix. I do not have the extra income to...">
	<meta name="robots" content="noarchive,nofollow,unavailable_after: 29-Mar-18 20:26:39 EDT">
	<meta name="twitter:card" content="preview">
	<meta property="og:description" content="Hi, I am looking for someone who would


More information on [request status codes](http://www.restapitutorial.com/httpstatuscodes.html).

<a id='step2'></a>
### 2) Parse the HTML document with Beautiful Soup.

This step allows us to access the elements of the document by XPath expressions.

In [45]:
soup = BeautifulSoup(html, 'lxml')

In [46]:
# Singular element:
soup.html.title

<title>Cheap or free running car or truck - wanted - by owner - sale</title>

In [47]:
# Just the text between elements:
print(soup.html.title.text)

Cheap or free running car or truck - wanted - by owner - sale


In [49]:
# Find single or multiple elements.
# First parameter:
element = soup.find_all("a", {"class": "header-logo"})
element[0]

<a class="header-logo" href="/" name="logoLink">CL</a>

In [50]:
price_search = soup.findAll('span', {"class": "price"})
price_search[0].text

'$200'

In [51]:
# What about all car listings in ATL:
response = requests.get("https://atlanta.craigslist.org/search/cto")

In [67]:
soup = BeautifulSoup(response.text, "lxml")
result_list = soup.find_all('p', {'class':'result-info'})

results = []
for result in result_list:
    car = {}
    car['text'] = result.find('a', {'class':'hdrlnk'}).text
    car['price'] = int(result.find('span', {'class':'result-price'}).text.replace('$',''))
    hood = result.find('span', {'class':'result-hood'})
    car['hood'] = hood.text.replace('(','').replace(')','') if hood else None
    results.append(car)

In [68]:
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,hood,price,text
0,Snellville,4180,2004 Lexus ES 330 great Condition runs and dri...
1,Adairsville,3500,1997 Honda crv
2,Lawrenceville,800,1999 Nissan Quest - parts or mechanic's special
3,Atl,2000,2002 Ford Explorer
4,Atl,2500,1999 Lexus es300 one owner
5,Atl,2500,1998 Honda CR-V
6,Newnan,8500,2013 Nissan Leaf SV - CPO
7,Atl,1700,2000 jeep Cherokee
8,,7999,2009 BMW 328i $7999
9,snellville,8900,Toyota Tundra 2008


**Practice**

- How would you get the next 120 results?
- How would you get the text associated with a particular car?

Another example


In [None]:

# You'll need the requests library in order to fully utilize bs4.
import requests
from bs4 import BeautifulSoup

# Target web page:
url = "https://www.datatau.com/"

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

# Pull the HTML string out of requests and convert it to a Python string.
html = response.text

# The first 700 characters of the content.
#print(html)

More information on [request status codes](http://www.restapitutorial.com/httpstatuscodes.html).

<a id='step2'></a>
### 2) Parse the HTML document with Beautiful Soup.

This step allows us to access the elements of the document by XPath expressions.

soup = BeautifulSoup(html, 'lxml')

# This code collects the titles, links and urls for DataTau's homepage

# List to store results
results_list = []

# Get all the <td class="title"... elements
all_td = soup.find_all('td', {'class':'title'})
for element in all_td:
    # start a dictionary to store this item's data
    result = {}
    
    # get the title and full link/url
    a_href = element.find('a')
    if a_href:
        result['title'] = a_href.text   # element text
        result['link'] = a_href['href'] # href link
        
    # get the url domain
    span = element.find('span', {'class':'comhead'})
    if span:
        result['url'] = span.text.strip()[1:-1]
        
    # only store "full" rows of data
    if len(result) == 3:
        results_list.append(result)
        
results_list[0]

import pandas as pd

pd.DataFrame(results_list)