# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page. For each of these tasks we have a Python library, `requests` and `bs4`, respectively.

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

In [7]:
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2018/03/25/')

In [8]:
r.text[:1000] # First 1000 characters of the HTML

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/" prefix="og: http://ogp.me/ns#">\n\n<head profile="https://gmpg.org/xfn/11">\n<script src="//cdn.optimizely.com/js/195632799.js"></script>\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\n\n<title>Events for March 25, 2018 Archives - Funcheap</title>\n\n<meta name="generator" content="WordPress" /> <!-- leave this for stats -->\n\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/style.css?v=1.8.14" type="text/css" media="screen" />\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/madmenu.css?v=1.1" type="text/css" media="screen" />\n<!--[if IE 6]>\n    <style type="text/css">\n    body {\n        behavior:url("

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need some way to pull the desired content from it. Luckily there is already a system in place to do this. With a combination of HMTL and CSS selectors we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

## Element Parent / Child Relationships

<img src="http://www.htmlgoodies.com/img/2007/06/flowChart2.gif" width="250">

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:**

```html
<body>
    <div>I am inside the parent element
        <div>I am inside a child element</div>
        <div>I am inside another child element</div>
        <div>I am inside yet another child element</div>
    </div>
</body>
```

<a id='attributes'></a>

## Element Attributes

Elements can also have attributes!  Attributes are defined inside **element tags** and can contain data that may be useful to scrape.

```html
<a href="http://lmgtfy.com/?q=html+element+attributes" title="A title" id="web-link" name="hal">A Simple Link</a>
```

The **element attributes** of this `<a>` tag element are:
- id
- href
- title
- name

This `<a>` tag example will render in your browser like this:
> <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">A Simple Link</a>


In [26]:
html = '''
<!DOCTYPE html>
<html>

<head>
  <title>The title of this web page</title>
</head>

<body>
  <h1>My Photos</h1>
  <div class='intro'>
    <p>These are some photos of my trips.</p>
    <img src="me.png">
  </div>

  <h3>Italy</h3>
  <div class='country' id='venice'>
    <img src="venice1.png" alt="Venice"> <br />
    <img src="venice2.png" alt="Venice"> <br />
    <img src="rome.png" alt="Roma">
  </div>

  <h3>Germany</h3>
  <div class='country'>
    <img src="berlin.png" alt="Berlin">
  </div>
  <a href="testing"/>
</body>

</html>
'''

# Using css selectors with BeautifulSoup

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [27]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


## Methods find vs findall:

In [11]:
soup.text

'\n\n\n\nThe title of this web page\n\n\nMy Photos\n\nThese are some photos of my trips.\n\n\nItaly\n\n \n \n\n\nGermany\n\n\n\n\n\n'

In [13]:
soup.find('div', attrs={'class':'country'})

<div class="country" id="venice">
<img alt="Venice" src="venice1.png"/> <br/>
<img alt="Venice" src="venice2.png"/> <br/>
<img alt="Roma" src="rome.png"/>
</div>

In [14]:
soup.find_all('div', attrs={'class':'country'})

[<div class="country" id="venice">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>, <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [15]:
soup.find_all('div', attrs={'class':'country', 'id':'venice'})

[<div class="country" id="venice">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>]

## Methods siblings

In [29]:
soup.select('div.intro')

[<div class="intro">
 <p>These are some photos of my trips.</p>
 <img src="me.png"/>
 </div>]

In [16]:
soup.find('h1').find_next_siblings()

[<div class="intro">
 <p>These are some photos of my trips.</p>
 <img src="me.png"/>
 </div>, <h3>Italy</h3>, <div class="country" id="venice">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>, <h3>Germany</h3>, <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [17]:
soup.find('h3').find_previous_siblings()

[<div class="intro">
 <p>These are some photos of my trips.</p>
 <img src="me.png"/>
 </div>, <h1>My Photos</h1>]

## .elements attributes

The `.next_element` attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as `.next_sibling`, but it’s usually drastically different.

In [24]:
soup.find('h1').next_element.next_element.next_element

<div class="intro">
<p>These are some photos of my trips.</p>
<img src="me.png"/>
</div>

## Getting css selector information on a webpage


1) Go on the page you want to scrape  
2) Open the inspector tool (right click + inspect or cmd + alt + i)  
3) Click on the icon with the mouse: ![image.png](attachment:image.png)
4) Select the element in the page you want to scrape. The HTML element will be shown on the right

**Note:** You need to repeat that for all the elements you want to scrape.

In [None]:
soup.findAll

<a id='xpath'></a>

## Enter XPath and scrapy 

XPath uses path expressions to select nodes or node-sets in an HTML/XML document. These path expressions look very much like the expressions you see when you work with a traditional computer file system.

<a id='xpath'></a>

## What is XPath?

---


Understanding how to identify elements and attributes within HTML documents gives us the capability to write simple expressions that create structured data.  We can think os XPath like a query language for querying HTML.

To make this process easier to deal with, we will be using ChroPath XPath helper, which is a Chrome addon.  It's not necessary, but highly recommended to help build XPath expressions.

[chroPath](https://chrome.google.com/webstore/detail/chropath/ljngjbnaijcbncmcnjfhigebomdlkcjo?hl=en)
=> to get the Xpath

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)
=> to verify them

XPath expressions can select elements, element attributes, and element text.  These selections can be either to a single item, or multiple items.  Generally, if you're not specific enough, you will end up selecting multiple elements.

#### Selecting elements matching an _attribute_

This will be one of the most common ways you will select items.  HTML DOM elements will be more differentiated based on their "class" and "id" variables.  Mainly, these types of attributes are used by web developers to refer to specfic elements or a broad set of elements to apply visual characteristics using CSS.

```HTML 
//element[@attribute="value"]
```
after that each `/` added will move to the next child element.

**Generally**

- "class" attributes within elements usually refer to multiple items
- "id" attributes are supposed to be unique, but not always



## Let's go back the above example :

In [None]:
html = '''
<!DOCTYPE html>
<html>

<head>
  <title>The title of this web page</title>
</head>

<body>
  <h1>My Photos</h1>
  <div class='intro'>
    <p>These are some photos of my trips.</p>
    <img src="me.png">
  </div>

  <h3>Italy</h3>
  <div class='country' id='venice'>
    <img src="venice1.png" alt="Venice"> <br />
    <img src="venice2.png" alt="Venice"> <br />
    <img src="rome.png" alt="Roma">
  </div>

  <h3>Germany</h3>
  <div class='country'>
    <img src="berlin.png" alt="Berlin">
  </div>
</body>

</html>
'''

In [None]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

In [None]:
# soup.find_all('div', attrs={'class':'country'})
# becomes with Xpath
from pprint import pprint
pprint(Selector(text=html).xpath('//div[@class="country"]').extract())

In [None]:
# soup.find_all('div', attrs={'class':'country', 'id':'venice'})
# becomes:
pprint(Selector(text=html).xpath('//div[@class="country"][@id="venice"]').extract())

In [None]:
# [e for e in soup.find('div', attrs={'class':'intro'}).children][:2]
# becomes:
pprint(Selector(text=html).xpath('//div[@class="intro"]/p').extract())

# Independent Practice:

Using either Beautifulsoup (css) or scrapy Selector (Xpath), Create a function that scrape all the articles on techcrunch homepage.
https://techcrunch.com/

**Bonus:** Implement a recursive loop to load more articles.

In [None]:
# A:

In [40]:
import pandas as pd

In [38]:
soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [97]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep


# get articvles functions 
def get_articles(num):
    driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
    #got to page
    driver.get("https://techcrunch.com/")
    
    # initial page load source 
    html = driver.page_source
    #conver tpage to soup object
    soup = BeautifulSoup(html)
    # get number of shwon articles
    articles_per_page =  len(soup.find_all('article', {'class' :'post-block post-block--image post-block--unread'}))
    #counter
    articles_on_page = 0
    



    # keep adding new articles until desire on page is reached
    while articles_on_page <= num:
        # scroll to bottom of page
        driver.execute_script("window.scrollTo(0, 10000);")
        #find more link
        next_button = driver.find_element_by_class_name('load-more') 
        #load more artciles to page
        next_button.click()
        #add to counter 
        articles_on_page += articles_per_page
        # wait 2 seconds
        sleep(5)

    
    # initial page load source 
    full_html = driver.page_source
    #conver tpage to soup object
    soup = BeautifulSoup(full_html)
    # create placeholder df
    df = pd.DataFrame(columns=['title','author','image'])
    # get list of all dom nodes for articles 
    for article in soup.find_all('article', {'class' :'post-block post-block--image post-block--unread'}):
        # get the title
        title = article.select_one('a.post-block__title__link').text
        # author
        author = article.select_one('span.river-byline__authors').select_one('a').text
        # image
        image = article.select_one('img').attrs['src']
        #store each article to df
        df.loc[len(df)] = [title, author, image]
        
    #close selnium
    driver.close()
    return df

In [98]:
articles = get_articles(50)
articles



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


WebDriverException: Message: unknown error: call function result missing 'value'
  (Session info: chrome=71.0.3578.80)
  (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.14.0 x86_64)
