# Lab 2b Part 1: Web Scraping

While web APIs and downloadable CSV files are convenient, a lot of online data is only available embedded in web pages.  Accessing these data using custom web scraping code is the only way one can collect it. For example, Billboard.com, the data-dense site we are going to scrape today using __beautiful soup__, does not have an official API.  In order to access this data, the raw HTML needs to be downloaded and processed to extract fields of interest.

## Task 0
Open the following in separate tabs (or desktops):

* [requests](http://docs.python-requests.org/en/master/user/quickstart/)
* [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag)
* [billboard](http://www.billboard.com/charts/hot-100])

Today, we will be using two libraries: __requests__ and __beautiful soup__. 

In [2]:
import requests
from bs4 import BeautifulSoup

# ---
## Task 1
__Using the python _requests_ library, send a HTTP GET request to http://www.billboard.com/charts/hot-100 and then print the text of the response. The _requests_ user guide is available [here](http://docs.python-requests.org/en/master/user/quickstart/).__

In [3]:
# YOUR CODE HERE

<!doctype html>
<html class="" lang="">
<head>

<script>
        _udn = "billboard.com";
    </script>
<script>function utmx_section(){}function utmx(){}(function(){var
                k='67942495-39',d=document,l=d.location,c=d.cookie;
            if(l.search.indexOf('utm_expid='+k)>0)return;
            function f(n){if(c){var i=c.indexOf(n+'=');if(i>-1){var j=c.
                    indexOf(';',i);return escape(c.substring(i+n.length+1,j<0?c.
                    length:j))}}}var x=f('__utmx'),xx=f('__utmxx'),h=l.hash;d.write(
                    '<sc'+'ript src="'+'http'+(l.protocol=='https:'?'s://ssl':
                            '://www')+'.google-analytics.com/ga_exp.js?'+'utmxkey='+k+
                            '&utmx='+(x?x:'')+'&utmxx='+(xx?xx:'')+'&utmxtime='+new Date().
                            valueOf()+(h?'&utmxhash='+escape(h.substr(1)):'')+
                            '" type="text/javascript" charset="utf-8"><\/sc'+'ript>')})();
    </script><script>utmx('url','A/B')

If you see a lot of text beginning with something like the following:

    <!doctype html>
    <html class="" lang="">
    <head>
    ...

then you have obtained the HTML from billboard that we will want to parse today. 

The python package Beautiful Soup converts HTML pages into a tree representation that can be easily navigated.
Let's use __beautiful soup__ to more easily navigate the HTML.

In [4]:
# Beautiful soup allows us to treat HTML as a tree
soup = BeautifulSoup(r.text, 'lxml')

In order to navigate HTML as tree we need to understand what HTML is. Below a basic intro to HTML.  Additional information is available from [Mozilla](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics):
***
### So what is HTML, really?
HTML consists of a series of elements, which you use to enclose, or wrap, different parts of the content to cause it to render (appear) in a certain way, or act a certain way, when viewed in a browser. Enclosing tags can make a word or an image a hyperlink to somewhere else, can italicize words, and can make font bigger or smaller, and organize text into blocks and paragraphs.  So on.  For example, this string

`My cat is very grumpy`

Can be rendered as a paragraph by enclosing it in paragraph tags:

`<p>My cat is very grumpy</p>`

#### Anatomy of an HTML element
Let's explore this paragraph element a bit further.

<img src="https://mdn.mozillademos.org/files/9347/grumpy-cat-small.png" width=500>

The main parts of our element are:

* The opening tag: This consists of the name of the element (in this case, p), wrapped in opening and closing angle brackets. This states where the element begins, or starts to take effect — in this case where the start of the paragraph is.
* The closing tag: This is the same as the opening tag, except that it includes a forward slash before the element name. This states where the element ends — in this case where the end of the paragraph is. Failing to include a closing tag is one of the common beginner errors and can lead to strange results.
* The content: This is the content of the element, which in this case is just text.
* The element: The opening tag plus the closing tag plus the content equals the element.


Elements can also have attributes, which look like this:

<img src="https://mdn.mozillademos.org/files/9345/grumpy-cat-attribute-small.png" width=500>

Attributes contain extra information about the element that you don't want to appear in the actual content. Here, class is the attribute name, and editor-note is the attribute value. The class attribute allows you to give the element an identifier that can be later used to target the element with style information and other things.

An attribute should always have:

* A space between it and the element name (or the previous attribute, if the element already has one or more attributes).
* The attribute name, followed by an equals sign.
* Opening and closing quote marks wrapped around the attribute value.  
---
Run the following command to get the first `div` element with the attribute class of value `chart-row__primary.

In [5]:
soup.find('div', class_='chart-row__primary')

<div class="chart-row__primary">
<div class="chart-row__history chart-row__history--steady"></div>
<div class="chart-row__main-display">
<div class="chart-row__rank">
<span class="chart-row__current-week">1</span>
<span class="chart-row__last-week">Last Week: 1</span>
</div>
<div class="chart-row__image" style="background-image: url(http://charts-static.billboard.com/img/2016/12/taylor-swift.jpg)">
</div>
<div class="chart-row__container">
<div class="chart-row__title">
<h2 class="chart-row__song">Look What You Made Me Do</h2>
<a class="chart-row__artist" data-tracklabel="Artist Name" href="/artist/371422/taylor-swift">
Taylor Swift
</a>
</div>
</div>
<div class="chart-row__links">
<a class="chart-row__link chart-row__link--toggle js-chart-row-toggle" href="javascript:void(0);">
<i class="chart-row__icon fa fa-angle-down"></i>
</a>
</div>
</div>
</div>

---
## Task 2

__Based on the HTML above, classify each of the following as _tags_ or _attributes_. For example, `p` is an element and `class` is an attribute.__

1. `div`
2. `a`
3. `data-tracklabel`
4. `href`
5. `span`

To select HTML elements in Beautiful soup, we use the following syntax:

    soup.p # Selects the first p element
    soup.h1 # Select the first h1 element

We can also use the find method and find_all methods. Run the code below and then replace the expression with `soup.find_all('h1')` to get all of the `h1` elements.

In [7]:
# Replace the code here
soup.find('h1')

<h1 class="site-header__brand">
<a class="site-header__brand-link" href="/">
<img alt="Billboard" class="site-header__brand-logo" src="/static/frontend/2017_09_18_1748/assets/images/Billboard-white.svg"/>
<span class="site-header__brand-name">Billboard</span>
</a>
</h1>

You should see that your code returns a list
    
    [<h1 class="site-header__brand">
    <a class="site-header__brand-link" href="/">
    <img alt="Billboard" class="site-header__brand-logo" src="/static/frontend/2017_09_18_1748/assets/images/Billboard-white.svg"/>
    <span class="site-header__brand-name">Billboard</span>
    </a>
    </h1>]
    
    
Note that the `img` tag in your output has an attribute `alt` equal to `"Billboard"`. We can extract this attribute from the image tag using the following syntax.

    soup.TAG['ATTRIBUTE']
    
Try running the following to get the value of the attribute `class` for tag h1

In [8]:
soup.h1.a['class']

['site-header__brand-link']

---
## Task 3
__Output the value of the attribute `alt` in the first `img` tag from above. Your answer should be Billboard.__

In [None]:
# YOUR CODE HERE

When finding HTML elements, its hard to translate what is on the screen to raw HTML. Fortunately, modern browsers give you to opportunity to inspect the HTML code of different elements.

<img src="images/web-inspector.png" width="500px">

To do so, right click on some text in your browser. In the pop-up menu, click inspect. Now when you mouse over things on the web page, you can also see which HTML elements you are hovering over.

---

## Task 4

Did you know you can also edit the HTML with the web inspector and see the results render? This is useful for when an interviewer asks you to meme and you aren't prepared.

<img src="images/edited.png" width="500px">

__Rename a song or artist and share it with your neighbor. Then fill in the following fields.__


Congrats! You've gained some very important skills. ;-) 

---

## Getting started on homework
__Using the web inspector and your ability to navigate Beautiful Soup's documentation, find out how to get the _song name_, _artist name_, and _song rank_ of all 100 songs in `soup`.__ Hint: Try iterating over the results of an appropriate soup.find_all.

In [None]:
# YOUR CODE HERE

# Examples of the elements stored as a list:
# [1, 'Look What You Made Me Do', 'Taylor Swift']
# [2, 'Bodak Yellow (Money Moves)', 'Cardi B']

# Summary

### Useful Words

* HTML - Hypertext Markup Language

    * tags
    * attributes
    * elements
  
  Example: `<tag attribute="value">text</tag>`


### Python Workflow
```
import requests
from bs4 import BeautifulSoup

url = ...
r = requests.get(url)

# Converts the HTML string into a navigable Python object
soup = BeautifulSoup(r.text, 'lxml')

soup.<TAG_1>.<TAG_2>.<TAG_3>['<ATTRIBUTE_NAME>']
```
### Example Beautiful Soup Expressions
```
soup.text
soup.a['href']
soup.find('div', class_='chart-row__song')
soup.find_all('a', href=re.compile('notebooks')) # You can use regex too
```
