# Week 8 - Web Scraping

**Optional Reading:** <br>
  Data Wrangling with Python, Chapter 11 and 12  (pages 279 - 359)

**Highly Recommended Reading:** <br>
   - https://www.dataquest.io/blog/web-scraping-tutorial-python/
   - https://www.codingdojo.com/blog/html-vs-css-inforgraphic

**Overview:**<br>

* Web Scraping

  * Online Tutorial

    
* Use Case: BeautifulSoup
   - Web Page Inspection
   - Web Scraping Example
   

* Use Case: gazpacho
   - Tabular Data
   - Multiple Tables on a Single Web Page 

# Web Scraping

In today's world, nearly everything can be found on the internet.  However, not everything is easily accessible. Web scarping is one way of mining data from the internet. Web scraping has two components: fetching the page and extracting data from the page. 

Fetching is the downloading of a page for use. If all you wanted to do was "view" a web page, the fetching is accomplished by your favorite broswer. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list (contact scraping).
(https://en.wikipedia.org/wiki/Web_scraping)



## Online Tutorial
As mentioned above in the readings section of this FTE, please take the time to work through the online tutorial at https://www.dataquest.io/blog/web-scraping-tutorial-python/.

<div class="alert alert-block alert-warning">
<b>Note about Web Scraping:</b> Some websites frown upon their sites being scraped and they will block or blacklist your ip address with no warning. Proceed with caution!
    
Many websites publish a file for individuals to determine where or not scraping is allowed.  You can find this file by typing the domain name and the /robots.txt.  As an example, here is how you will find the robots.txt file for Regis's website. (https://www.regis.edu/robots.txt) 
```
User-agent: *
Disallow: /News-Events-Media/Event-Calendar/Event.aspx?*
sitemap: /sitemap.xml
```
</div>

# Use Case: BeautifulSoup 
The tutorial on dataquest uses the python library BeautifulSoup.  We will use BeautifulSoup to gather a couple of the text fields off of https://www.marketwatch.com/investing/index/djia

### Web Page Inspection
Before we actually start scraping the information from thispage, let's inspect the page a little closer.

<img align="center" style="padding-right:10px;" src="figures_8/DJIA_website.png" width=600 >

For this exercise, let's try to retrieve the information in the upper left corner directly under the title Dow Jones Industrial Average.

<img align="center" style="padding-right:10px;" src="figures_8/DJIA_retrieve.png" >

As the tutorial pointed out, Chrome makes inspected the HTML tags on a web page very easy.  Upon, inspection, we can see the following:

<img align="left" style="padding-right:10px;" src="figures_8/DJIA_inspection.png">

### Web Scraping Example

<div class="alert alert-block alert-success">
<b>Installation - requests and beautifulsoup4</b> <br>
    pip install requests <br>
    pip install beautifulsoup4
</div>

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [3]:
import requests
from bs4 import BeautifulSoup

In [4]:
url = "https://www.marketwatch.com/investing/index/djia"
page = requests.get(url)

# verify the status_code is a 200
page

<Response [200]>

So far as good.  Time gather the data from the smaller section of the web page.

From our inspection we can see that we need to retrieve the HTML tag `<div class="element element--intraday">`

In [5]:
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find('div', class_='element element--intraday')

In [4]:
# looking at all the HTML tags within this subsection of the web page
data

<div class="element element--intraday">
<small class="intraday__status status--closed"><span class="company__ticker">DJIA</span><span class="company__market">US</span><i class="icon--lock"></i><div class="status">Closed</div><span class="scroll-top">Back To Top</span></small>
<div class="intraday__timestamp">
<span class="timestamp__time">Last Updated: <bg-quote channel="/zigman2/quotes/210598065/realtime" field="date">Dec 13, 2019 5:14 p.m.</bg-quote> EST</span>
</div>
<div class="intraday__data">
<h3 class="intraday__price">
<sup class="character"></sup>
<span class="value">28,135.38</span>
</h3>
<bg-quote channel="/zigman2/quotes/210598065/realtime" class="intraday__change positive">
<i class="icon--caret"></i>
<span class="change--point--q">3.33</span>
<span class="change--percent--q">0.01%</span>
</bg-quote>
</div>
<div class="intraday__close">
<table class="table table--primary align--right">
<thead>
<tr class="table__row">
<th class="table__heading">Previous Close</th>
</tr>
</t

At this point we could use string manipulation to parse our data into the individual elements within this section of the web page. <br>
<b> OR </b> <br> 
We could go after the individual elements by narrowing our search criteria within BeautifulSoup.

Let's try to narrow our search and return the following specific elements:
   * status of the market 
   * current price 
   * point amount of the change
   * percentage of the change
   * timestamp

Might as well start working from the top of our list.

HTML: status of the market<br>
`<div class="status">Closed</div>`

In [6]:
# status of the market
status = soup.find('div', class_="status").get_text()
status

'Closed'

Getting the price is going to be a little bit more interesting. The actual value is nested within a couple of HTML elements.

Good news! We can chain our find() operations together.

HTML: current price<br>
`<h3 class="intraday__price"> ` <br>
`<sup class="character"></sup>` <br>
`<span class="value">28,135.38</span>`<br>
`</h3>`


In [7]:
# current price
price = soup.find('h3', class_="intraday__price").find('span', class_="value").get_text()
price

'28,239.28'

HTML: point amount of the change<br>
`<span class="change--point--q">3.33</span>`

In [8]:
# point amount of the change
change_amount = soup.find('span', class_="change--point--q").get_text()
change_amount

'-27.88'

HTML: percentage of the change <br>
`<span class="change--percent--q">0.01%</span>`

In [10]:
# percentage of the change
change_percentage = soup.find('span', class_="change--percent--q").get_text()
change_percentage

'-0.10%'

HTML: timestamp <br>
`<span class="timestamp__time">Last Updated: <bg-quote channel="/zigman2/quotes/210598065/realtime" field="date">`


In [11]:
# timestamp
timestamp = soup.find('span', class_="timestamp__time").find('bg-quote').get_text()
timestamp

'Dec 18, 2019 4:52 p.m.'

In [10]:
# print all of our fields out
print(f'The current of the DJIA as of {timestamp} is {price}. \nThis is a change of {change_amount} or {change_percentage} since the last report')

The current of the DJIA as of Dec 13, 2019 5:14 p.m. is 28,135.38. 
This is a change of 3.33 or 0.01% since the last report


Hooray! All of this matches the data found on the web page.


***
# Use Case: Gazpacho
gazpacho is a web scraping library. It replaces requests and BeautifulSoup for most projects. gazpacho is small, simple, fast, and consistent. (https://pypi.org/project/gazpacho/ and https://maxhumber.com/scraping_fantasy_hockey)

<div class="alert alert-block alert-success">
<b>Installation - gazpacho</b> <br>
pip install -U gazpacho
</div>

In [11]:
# verify we still have the url of our web page
url

'https://www.marketwatch.com/investing/index/djia'

In [12]:
#imports
from gazpacho import get, Soup

Here's is where the power of gazpacho starts to come into play.  Notice that we don't need to use 2 python packages to fetch and extract our web page.

In [13]:
# fetch step
html = get(url)

# extract step
soup = Soup(html)

In [14]:
# use find() to parse a specific HTML
gaz_data = soup.find('div', {'class':'element element--intraday'})
gaz_data

<div class="element element--intraday">
                <small class="intraday__status status--closed"><span class="company__ticker">DJIA</span><span class="company__market">US</span><i class="icon--lock"></i><div class="status">Closed</div><span class="scroll-top">Back To Top</span></small>
        
            <div class="intraday__timestamp">
                <span class="timestamp__time">Last Updated: <bg-quote field="date" channel="/zigman2/quotes/210598065/realtime">Dec 13, 2019 5:14 p.m.</bg-quote> EST</span>
                
            </div>
            <div class="intraday__data">
                <h3 class="intraday__price ">
                    <sup class="character"></sup>
                    <span class="value">28,135.38</span>
                </h3>
                <bg-quote channel="/zigman2/quotes/210598065/realtime" class="intraday__change positive">
                    <i class="icon--caret"></i>
                    <span class="change--point--q">3.33</span>
          

Okay?!?!  So it's a little easier to scrap the web page with gazpacho, but it really doesn't seem to be a huge difference from BeautifulSoup at this point.

***
### Tabular Data
One area that gazpacho makes really easy is in parsing a table within a webpage. The last section of our gaz_data has a web table within it.   Let's see how gazpacho handle the data in this table.

In [15]:
# narrow things down to the table
gaz_table = soup.find('div', {'class':'intraday__close'})
gaz_table

<div class="intraday__close">
                <table class="table table--primary align--right">
                    <thead>
                        <tr class="table__row">
                                <th class="table__heading">Previous Close</th>
                        </tr>
                    </thead>
                    <tbody class="remove-last-border">
                        <tr class="table__row">
                            <td class="table__cell u-semi">28,132.05</td>
                        </tr>
                    </tbody>
                </table>
            </div>

Now if we could partner our extracted data with pandas dataframes to orgainize and store this information, that would be fabulous!

In [16]:
import pandas as pd

df = pd.read_html(str(gaz_table.html))
df

[   Previous Close
 0        28132.05]

Interesting!!  The data matches, but this does not look like a pandas df.

In [17]:
type(df)

list

Oh! Since this is a list, we need to reference the index within the list.

In [18]:
df = pd.read_html(str(gaz_table.html))[0]
df

Unnamed: 0,Previous Close
0,28132.05


Awww.. Getting better...  But still not as impressive as it could have been.  That praticular table only has one row and column.  

Let's try something a bit more impressive.   Let's try weather data for Denver, CO.
(https://weather.com/weather/hourbyhour/l/40097af22d207c3c7e8e3bfa1102bc635296631a89ac0d4eb688f926f092f4fd)

<img align="left" style="padding-right:10px;" src="figures_8/Weather_Den_CO.png" width=600 >

In [19]:
# define our URL
url_2 = "https://weather.com/weather/hourbyhour/l/40097af22d207c3c7e8e3bfa1102bc635296631a89ac0d4eb688f926f092f4fd"

In [20]:
# fetch step
html_2 = get(url_2)

# extract step
soup_2 = Soup(html_2)

One of the benefits of gazpacho in working with tabular data is that you really don't need to know the HTML tags to retrieve the information.  So, all we should need to do is pass our soup_2 variable to pandas.

In [21]:
df = pd.read_html(str(soup_2.html))[0]
df

Unnamed: 0,Time,Description,Temp,Feels,Precip,Humidity,Wind,Unnamed: 7
0,,7:15 pmSun,Cloudy,27°,27°,0%,63%,NNE 3 mph
1,,8:00 pmSun,Cloudy,27°,27°,0%,68%,WSW 3 mph
2,,9:00 pmSun,Cloudy,26°,22°,0%,74%,WSW 4 mph
3,,10:00 pmSun,Cloudy,26°,22°,0%,78%,W 3 mph
4,,11:00 pmSun,Cloudy,26°,22°,5%,73%,NW 4 mph
5,,12:00 amMon,Cloudy,26°,26°,10%,74%,N 2 mph
6,,1:00 amMon,Cloudy,26°,22°,15%,72%,NE 4 mph
7,,2:00 amMon,Cloudy,25°,21°,20%,75%,NE 4 mph
8,,3:00 amMon,Cloudy,26°,21°,25%,75%,NNE 4 mph
9,,4:00 amMon,Cloudy,25°,20°,25%,77%,N 4 mph


5 lines of code and we a dataframe witht he contents of the web page.  Not bad!

***

### Multiple Tables on a Single Web Page
What is we wanted to scrap a web page that contained multiple tables?

Let's take a look at https://www.espn.com/college-football/stats <br>
We can see that there are actually 6 smaller tables on this one page.

<img align="left" style="padding-right:10px;" src="figures_8/College_stats.png" width=600 >

Hopefully gazpacho handles these as smoothly as the prior examples.

In [22]:
college_stats = "https://www.espn.com/college-football/stats"

In [23]:
# fetch/extract
html_stats = get(college_stats)
soup_stats = Soup(html_stats)

# store in df
stats = pd.read_html(str(soup_stats.html))[0]
stats

Unnamed: 0,Passing,YDS
0,1Anthony GordonWSU,5228
1,2Joe BurrowLSU,4715
2,3Josh LoveSJSU,3923
3,4Brock PurdyISU,3760
4,5Cole McDonaldHAW,3642
5,Complete Leaders,Complete Leaders


Hmmm... we got one of the tables. In the prior examples, we had a list with a single element for one table.  Perhaps, each table is a separate list element.

In [24]:
len(stats)

6

Looks like it.  So now we need to loop through `stats` and pull out each table into a separate dataframe.

To do this we will create a dictionary were the key is a dataframe name and the value is the dataframe itself

In [25]:
df_dict={}
for i in range(len(stats)):
    stats = pd.read_html(str(soup_stats.html))[i]
    name = 'stats_'+str(i)
    df_dict[name]=stats

In [26]:
# verifying one of the dataframes for now
df_dict['stats_3']

Unnamed: 0,Tackles,TOT
0,1Evan WeaverCAL,172
1,2Dele HardingILL,148
2,3John LakoAKR,138
3,4Javahn FergursonNMSU,133
4,5Treshaun HaywardWMU,132
5,Complete Leaders,Complete Leaders


We can loop through and printout each dataframe.

In [27]:
# display each df - print will remove some of the pretty formatting
for df in df_dict.values():
    print(df)

              Passing               YDS
0  1Anthony GordonWSU              5228
1      2Joe BurrowLSU              4715
2      3Josh LoveSJSU              3923
3     4Brock PurdyISU              3760
4   5Cole McDonaldHAW              3642
5    Complete Leaders  Complete Leaders
               Rushing               YDS
0   1Chuba HubbardOKST              1936
1  2Jonathan TaylorWIS              1909
2     3J.K. DobbinsOSU              1829
3   4Malcolm PerryNAVY              1804
4         5AJ DillonBC              1685
5     Complete Leaders  Complete Leaders
                  Receiving               YDS
0         1Ja'Marr ChaseLSU              1498
1         2Omar BaylessARST              1473
2  3Antonio Gandy-GoldenLIB              1333
3        4Devin DuvernayTEX              1294
4         5Gabriel DavisUCF              1241
5          Complete Leaders  Complete Leaders
                 Tackles               TOT
0        1Evan WeaverCAL               172
1       2Dele HardingILL 

Awesome!  We have all 6 tables and they compare with the information off the web page. 

Before we wrap things up, we should mention that you can use `find()` within a gazpacho extract. 

Let's go after the same 6 tables, but narrow the extract search first.
<img align="left" style="padding-right:10px;" src="figures_8/College_stats_inspect.png">

HTML: 
`<div class="layout__column layout__column--1">`<br>
`<section class="Card statistics__main">`

In [28]:
stats_narrowed_search = soup_stats.find('div', {'class':'layout__column layout__column--1'}) \
                                  .find('section', {'class':"Card statistics__main"})

In [29]:
# let's go after just one of the tables
stats_narrowed = pd.read_html(str(stats_narrowed_search.html))[3]
stats_narrowed

Unnamed: 0,Tackles,TOT
0,1Evan WeaverCAL,172
1,2Dele HardingILL,148
2,3John LakoAKR,138
3,4Javahn FergursonNMSU,133
4,5Treshaun HaywardWMU,132
5,Complete Leaders,Complete Leaders


### Recap

<div class="alert alert-block alert-info">
<b>Helpful Hint::</b> Use the tool that matches the activity
</div>

As we have seen multiple times throughout this course, the selection of your toolset an important consideration in any data engineering effort. Almost always, there are multiple ways to complete any data engineering task. Depending on the libraries and data structures you elect to use, the difficulty of your task will vary.