## Web Scraping 101: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

Web scraping is simple due to the consistent format of information among web pages.

## HTML Refresher

### Overview
* HTML is the basic language used to create a web page. 
* It tells the web browser what text/media to display, where to display it, and how to display it (style)
* HTML is very structured/hirarchical. 
* Every page is made up of discrete "elements."

### Tags

* Elements are labeled with "tags."

* For example:

    ```html
    <p>You are beginning to learn HTML.</p>
    ```

### Attributes

* A start tag also often contains "attributes" with info about the element.

* Attributes usually have a name and value.

* Example:

```html
<p class="my_red_sentences">You are beginning to learn HTML.</p>
```

### Structure

A full HTML document has a structure more like this:

```html
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

### Explore in Browser

* Let's explore some live HTML!
* Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser, preferably Chrome.
* Click Inspect Element, also click on View Page Source.

## HTML to BeautifulSoup

### Request data for The Big Lebowski

Scrape some information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [1]:
from __future__ import print_function, division

In [2]:
# if needed: pip install requests or conda install requests
import requests

requests.__path__

['/anaconda3/lib/python3.7/site-packages/requests']

In [3]:
url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

### Check the Status

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [4]:
response.status_code # status code = 200 => OK

200

### Look at the Text

In [5]:
#print(response.text)
type(response.text)

str

### Soupify the Text

### Step 5: Pandas Alternative

In [6]:
# you can also use pandas to read tables
import pandas as pd

url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

In [7]:
tables = pd.read_html(url)

In [8]:
tables[1]
# how can you fix the header?

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,924,925,926,927,928,929,930,931,932,933
0,Foreign Language 1980-PresentOnly overseas-pro...,Rank,Title (click to view),Studio,Lifetime Gross / Theaters,Opening / Theaters,Date,1,"Crouching Tiger, Hidden Dragon(Taiwan)",SPC,...,TBD,Kaala,,TBD,Kler,,TBD,Planeta Singli 2,,TBD
1,Rank,Title (click to view),Studio,Lifetime Gross / Theaters,Opening / Theaters,Date,,,,,...,,,,,,,,,,
2,1,"Crouching Tiger, Hidden Dragon(Taiwan)",SPC,"$128,078,872",2027,"$663,205",16,12/8/00,,,...,,,,,,,,,,
3,2,Life Is Beautiful(Italy),Mira.,"$57,563,264",1136,"$118,920",6,10/23/98,,,...,,,,,,,,,,
4,3,Hero(China),Mira.,"$53,710,019",2175,"$18,004,319",2031,8/27/04,,,...,,,,,,,,,,
5,4,Instructions Not Included,LGF,"$44,467,206",978,"$7,846,426",348,8/30/13,,,...,,,,,,,,,,
6,5,Pan's Labyrinth(Mexico),PicH,"$37,634,615",1143,"$568,641",17,12/29/06,,,...,,,,,,,,,,
7,6,Amelie(France),Mira.,"$33,225,499",303,"$136,470",3,11/2/01,,,...,,,,,,,,,,
8,7,Jet Li's Fearless(China),Rog.,"$24,633,730",1810,"$10,590,244",1806,9/22/06,,,...,,,,,,,,,,
9,8,Il Postino(Italy),Mira.,"$21,848,932",430,"$95,310",10,6/16/95,,,...,,,,,,,,,,


In [9]:
tables[2][0:5]

Unnamed: 0,0,1,2,3,4,5,6,7
0,Rank,Title (click to view),Studio,Lifetime Gross / Theaters,Opening / Theaters,Date,,
1,1,"Crouching Tiger, Hidden Dragon(Taiwan)",SPC,"$128,078,872",2027,"$663,205",16.0,12/8/00
2,2,Life Is Beautiful(Italy),Mira.,"$57,563,264",1136,"$118,920",6.0,10/23/98
3,3,Hero(China),Mira.,"$53,710,019",2175,"$18,004,319",2031.0,8/27/04
4,4,Instructions Not Included,LGF,"$44,467,206",978,"$7,846,426",348.0,8/30/13


Conclusion: Beautiful Soup is powerful but it has many limitations. If a page needs interactions (like entering password) or if a page is not static, but dynamically generated, we can't use Soup. We will explore other tools for that.

One such tool for tomorrow: *Selenium*

In [28]:
nfl_wk1_2017 = 'https://www.pro-football-reference.com/years/2017/week_1.htm'

In [29]:
football_tbls = pd.read_html(nfl_wk1_2017)

In [30]:
len(football_tbls)

31

In [31]:
football_tbls[28] #Last Game

Unnamed: 0,0,1,2
0,"Sep 11, 2017",,
1,Los Angeles Chargers,21.0,Final
2,Denver Broncos,24.0,


In [32]:
football_tbls[30] # Players of the Week

Unnamed: 0,Conf,Offense,Defense,Special Teams
0,AFC,Alex Smith,Calais Campbell,Giorgio Tavecchio
1,NFC,Sam Bradford,Trumaine Johnson,Matt Prater


In [15]:
nfl_inGm_sts = 'https://www.pro-football-reference.com/boxscores/201709100sfo.htm'
in_gm_stats = pd.read_html(nfl_inGm_sts)
len(in_gm_stats)

2

In [16]:
in_gm_stats[1]

Unnamed: 0,Quarter,Time,Tm,Detail,CAR,SFO
0,1.0,3:00,Panthers,Russell Shepard 40 yard pass from Cam Newton (...,7,0
1,2.0,3:23,Panthers,Graham Gano 39 yard field goal,10,0
2,,0:00,Panthers,Graham Gano 36 yard field goal,13,0
3,3.0,11:24,Panthers,Jonathan Stewart 9 yard pass from Cam Newton (...,20,0
4,,3:11,Panthers,Graham Gano 20 yard field goal,23,0
5,,0:08,49ers,Robbie Gould 44 yard field goal,23,3


In [17]:
import requests
stats = requests.get(nfl_inGm_sts).text

In [18]:
stats = stats.replace('\n', '').replace('<!--', '').replace('-->', '')

In [19]:
stats



In [20]:
game_html = pd.read_html(stats)

In [21]:
len(game_html)

36

In [22]:
game_info = game_html[17]
game_info

Unnamed: 0,0,1
0,Game Info,
1,Won Toss,49ers (deferred)
2,Roof,outdoors
3,Surface,grass
4,Weather,"87 degrees, wind 8 mph"
5,Vegas Line,Carolina Panthers -4.5
6,Over/Under,45.5 (under)


In [23]:
score = game_html[15]
score

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,1,2,3,4,Final
0,via Sports Logos.net\tAbout logos,Carolina Panthers,7,6,10,0,23
1,via Sports Logos.net\tAbout logos,San Francisco 49ers,0,0,3,0,3


In [24]:
team_stats = game_html[20]
team_stats

Unnamed: 0.1,Unnamed: 0,CAR,SFO
0,First Downs,20,13
1,Rush-Yds-TDs,38-116-0,15-51-0
2,Cmp-Att-Yd-TD-INT,14-25-171-2-1,24-35-193-0-1
3,Sacked-Yards,0-0,4-27
4,Net Pass Yards,171,166
5,Total Yards,287,217
6,Fumbles-Lost,2-1,1-1
7,Turnovers,2,2
8,Penalties-Yards,5-40,10-74
9,Third Down Conv.,7-13,2-11


In [25]:
off_stats = game_html[21]

In [26]:
def_stats = game_html[32]
def_stats

Unnamed: 0,Player,Tm,L End,L Tckl,L Guard,Middle,R Guard,R Tckl,R End
0,Mike Adams,CAR,1,0,0,1,0,0,0
1,Mario Addison,CAR,0,0,1,0,0,0,0
2,James Bradberry,CAR,0,0,0,0,0,0,1
3,Kurt Coleman,CAR,1,0,0,0,0,0,0
4,Thomas Davis,CAR,0,0,0,1,1,0,1
5,Wes Horton,CAR,0,0,0,1,0,0,0
6,Luke Kuechly,CAR,0,2,0,1,0,0,0
7,Star Lotulelei,CAR,0,0,0,1,0,0,0
8,Kyle Love,CAR,0,0,1,0,0,0,0
9,Kawann Short,CAR,0,2,0,0,0,0,0


In [51]:
week1 = requests.get('https://www.pro-football-reference.com/years/2017/week_1.htm').text

In [52]:
week1



In [47]:
h = strip_html(week1)



In [53]:
def strip_html_nwlns_cmnts(html_txt):
    rslt = html_txt.replace('\n', '').replace('<!--', '').replace('-->', '')
    return rslt



In [56]:
import requests

def strip_html_nwlns_cmnts(html_txt):
    rslt = html_txt.replace('\n', '').replace('<!--', '').replace('-->', '')
    return rslt

week1 = requests.get('https://www.pro-football-reference.com/years/2017/week_1.htm').text

striped = strip_html_nwlns_cmnts(week1)

In [57]:
is_pd_html_rd_work = pd.read_html(striped)

In [59]:
len(is_pd_html_rd_work)

35

In [65]:
is_pd_html_rd_work[29]

Unnamed: 0,0,1,2
0,PassYds,Siemian-DEN,219
1,RushYds,Anderson-DEN,81
2,RecYds,Thomas-DEN,67


In [66]:
 weekly_scores_pfbr_page_table_index = {'top_defenders': 34, 'top_rushers': 33, 'top_rcvrs:': 32, 'top_passers': 31}

In [None]:
_2018_weekly_NFL_scores = []

nfl_reference_base_url = 'https://www.pro-football-reference.com/years/2018/week_{}.htm'

def nfl_week_url(week):
    return nfl_reference_base_url.replace('{}', str(week))

_2018_weekly_NFL_scores = [nfl_week_url(wk) for wk in range(1,21)]


for url in _2018_weekly_NFL_scores:
    print(url)

weekly_score_summary_html = []

import time

def strip_html_nwlns_cmnts(html_txt):
    rslt = html_txt.replace('\n', '').replace('<!--', '').replace('-->', '')
    return rslt

for url in _2018_weekly_NFL_scores:
    raw_fetch = requests.get(url).text
    fetched = strip_html_nwlns_cmnts(raw_fetch)
    weekly_score_summary_html.append(fetched)
    print(f'Fetched {url}')
    print('zzzzzzzz')
    time.sleep(3)
    print('woke_up')

#weekly_score_summary_html

import pickle

output = open('/Users/bjg/Desktop/wkly_scre_list.pkl', 'wb')

pickle.dump(weekly_score_summary_html, output)

output.close()


#pklfl = open('/Users/bjg/Desktop/wkly_scre_list.pkl', 'rb')
#lst = pickle.load(pklfl)
#pklfl.close()