# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page. For each of these tasks we have a Python library, `requests` and `bs4`, respectively.

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

In [1]:
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2018/03/25/')

In [2]:
r.text[:1000] # First 1000 characters of the HTML

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec"  xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">\n\n<head profile="https://gmpg.org/xfn/11">\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\n\n<title>Events for March 25, 2018 Archives - Funcheap</title>\n\n<meta name="generator" content="WordPress" /> <!-- leave this for stats -->\n\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/style.css?v=1.8.15" type="text/css" media="screen" />\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/madmenu.css?v=1.1" type="text/css" media="screen" />\n<!--[if IE 6]>\n    <style type="text/css">\n    body {\n        behavi

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need some way to pull the desired content from it. Luckily there is already a system in place to do this. With a combination of HMTL and CSS selectors we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

## Element Parent / Child Relationships

<img src="http://www.htmlgoodies.com/img/2007/06/flowChart2.gif" width="250">

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:**

```html
<body>
    <div>I am inside the parent element
        <div>I am inside a child element</div>
        <div>I am inside another child element</div>
        <div>I am inside yet another child element</div>
    </div>
</body>
```

<a id='attributes'></a>

## Element Attributes

Elements can also have attributes!  Attributes are defined inside **element tags** and can contain data that may be useful to scrape.

```html
<a href="http://lmgtfy.com/?q=html+element+attributes" title="A title" id="web-link" name="hal">A Simple Link</a>
```

The **element attributes** of this `<a>` tag element are:
- id
- href
- title
- name

This `<a>` tag example will render in your browser like this:
> <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">A Simple Link</a>


In [3]:
html = '''
<!DOCTYPE html>
<html>

<head>
  <title>The title of this web page</title>
</head>

<body>
  <h1>My Photos</h1>
  <div class='intro'>
    <p>These are some photos of my trips.</p>
    <img src="me.png">
  </div>

  <h3>Italy</h3>
  <div class='country' id='venice'>
    <img src="venice1.png" alt="Venice"> <br />
    <img src="venice2.png" alt="Venice"> <br />
    <img src="rome.png" alt="Roma">
  </div>

  <h3>Germany</h3>
  <div class='country'>
    <img src="berlin.png" alt="Berlin">
  </div>
</body>

</html>
'''

# Using css selectors with BeautifulSoup

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

## Methods find vs findall:

In [5]:
soup.find('div', attrs={'class':'country'})

<div class="country" id="venice">
<img alt="Venice" src="venice1.png"/> <br/>
<img alt="Venice" src="venice2.png"/> <br/>
<img alt="Roma" src="rome.png"/>
</div>

In [6]:
soup.find_all('div', attrs={'class':'country'})

[<div class="country" id="venice">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>, <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [7]:
soup.find_all('div', attrs={'class':'country', 'id':'venice'})

[<div class="country" id="venice">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>]

## Methods siblings

In [8]:
soup.find('h1').find_next_siblings()

[<div class="intro">
 <p>These are some photos of my trips.</p>
 <img src="me.png"/>
 </div>, <h3>Italy</h3>, <div class="country" id="venice">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>, <h3>Germany</h3>, <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [9]:
soup.find('h3').find_previous_siblings()

[<div class="intro">
 <p>These are some photos of my trips.</p>
 <img src="me.png"/>
 </div>, <h1>My Photos</h1>]

## .elements attributes

The `.next_element` attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as `.next_sibling`, but it’s usually drastically different.

In [10]:
soup.find('h1').next_element.next_element.next_element

<div class="intro">
<p>These are some photos of my trips.</p>
<img src="me.png"/>
</div>

## Getting css selector information on a webpage


1) Go on the page you want to scrape  
2) Open the inspector tool (right click + inspect or cmd + alt + i)  
3) Click on the icon with the mouse: ![image.png](attachment:image.png)
4) Select the element in the page you want to scrape. The HTML element will be shown on the right

**Note:** You need to repeat that for all the elements you want to scrape.