![b4s](img/beautiful_soup.png)

## [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup)

## Benefits of *not* scraping
![options](img/other_options.png)

### Use case

![use](img/use_case.png)

### Goal

![python](img/how_works.png)

#### Discuss
What's a website you'd like  to scrape?

### Scenario

I want to analyze the top song award of the Grammies to see if I can find any patterns in country of origin, singer, song content, etc. 

But where do I start finding that data? Not from an API.

Well, we can start [here](https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year)

### This is our target
![target](img/target.png)

### Learning goals:

- scrape a basic wikipedia website using beautiful soup
- transform the html table we want to a pandas `DataFrame`
- scrape a more complex wikipedia
- transform the wanted scraped data into a pandas `DataFrame`
- if time, go hunt a wild website and scrape it

## Basic wikipedia

![vheck](img/basic.gif)

Task: Get one column from a table on wikipedia

Let's get those libraries we want

In [17]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the `url` inside of a `requests.get` and assign it to `website_url`

First, a wikipedia article where we only want to get one column of information - countries!

https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area

In [18]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area').text


Start to use the BeautifulSoup functions to create a BeautifulSoup object

In [19]:
soup = BeautifulSoup(website_url,'lxml')
# print(soup.prettify())

Find the class of interest

In [20]:
table = soup.find('table',{'class':'wikitable sortable'})

Keep looking at the html to see if you can find any commonalities in what you want to scrape....

All the country names are links! We can use the `a` tag!

In [21]:
links = table.find_all('a')

We can now iterate over links to process it and create a list of text

In [22]:
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)

['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Gilgit-Baltistan', 'Azad Kashmir', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'Northern Cyprus', 'State of Palestine', 'Brunei', 'Bahrain', 'Singapore', 'Maldives']


Now, let's convert that list to a data frame

In [23]:
df = pd.DataFrame()
df['Country'] = Countries

In [24]:
df.head()

Unnamed: 0,Country
0,Russia
1,
2,China
3,Hong Kong
4,Macau


## Less Basic - Get a whole table
Let's go inspect the webiste to find the right tag/heading/etc for the table we want

What are the important tags here?<br>
What class is the important one?

`table`<br>
`wikitable sortable`

**Task**<br>
Work with a partner to comment the following code and figure out what it does

In [30]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text
#get website url into response

soup = BeautifulSoup(response,'lxml')
#print(soup.prettify())

tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())

rows = tab.find_all('tr')
# get rows using tr tag 

data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])
# append in list data all the rows in the table within th and td tags and strip whitespace

df = pd.DataFrame(data)
# put the data into a pandas dataframe

new_header = df.iloc[0]
# new_header is assigned first row which are the column labels 

df = df[1:]
#fill our dataframe and skip first row since it will be my new_header

df.columns = new_header
# the dataframe columns is assigned the new_header

In [34]:
df.head()

Unnamed: 0,Year[I],Winner(s),Nationality,Work,Performing artist(s)[II],Nominees,Ref.
1,1959,Domenico Modugno,Italy,"""Volare"" *",Domenico Modugno,"Paul Vance & Lee Pockriss for ""Catch a Falling...",[10]
2,1960,Jimmy Driftwood,United States,"""The Battle of New Orleans""",Johnny Horton,"Sammy Cahn & Jimmy Van Heusen for ""High Hopes""...",[11]
3,1961,Ernest Gold,United States Austria,"""Theme of Exodus""",Instrumental (Various Artists),"Charles Randolph Grean, Joe Allison & Audrey A...",[12]
4,1962,Henry ManciniJohnny Mercer,United States,"""Moon River"" *",Henry Mancini,"Jimmy Dean for ""Big Bad John"" performed by Jim...",[13]
5,1963,Leslie BricusseAnthony Newley,United Kingdom,"""What Kind of Fool Am I?""",Sammy Davis Jr.,"Lionel Bart for ""As Long as He Needs Me"" perfo...",[14]


### But this is hard. Is there an easier way to do this?

Another way, if you **know** there is a `table` in the `html` somewhere

In [35]:
grammies = pd.read_html('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year')

`grammies` returns a `list` of `DataFrames`<br>
We still need to find the _correct_ one

In [39]:
len(grammies)

5

In [40]:
grammies[0]

Unnamed: 0,0,1
0,Grammy Award for Song of the Year,
1,Awarded for,Quality song containing both lyrics and melody
2,Country,United States
3,Presented by,National Academy of Recording Arts and Sciences
4,First awarded,1959
5,Currently held by,"Donald Glover, Ludwig Göransson & Jeffery Lama..."
6,Website,grammy.com


In [41]:
grammies[1]

Unnamed: 0,0,1,2,3,4,5,6
0,Year[I],Winner(s),Nationality,Work,Performing artist(s)[II],Nominees,Ref.
1,1959,Domenico Modugno,Italy,"""Volare"" *",Domenico Modugno,"Paul Vance & Lee Pockriss for ""Catch a Falling...",[10]
2,1960,Jimmy Driftwood,United States,"""The Battle of New Orleans""",Johnny Horton,"Sammy Cahn & Jimmy Van Heusen for ""High Hopes""...",[11]
3,1961,Ernest Gold,United States Austria,"""Theme of Exodus""",Instrumental (Various Artists),"Charles Randolph Grean, Joe Allison & Audrey A...",[12]
4,1962,Henry ManciniJohnny Mercer,United States,"""Moon River"" *",Henry Mancini,"Jimmy Dean for ""Big Bad John"" performed by Jim...",[13]
5,1963,Leslie BricusseAnthony Newley,United Kingdom,"""What Kind of Fool Am I?""",Sammy Davis Jr.,"Lionel Bart for ""As Long as He Needs Me"" perfo...",[14]
6,1964,Henry ManciniJohnny Mercer,United States,"""Days of Wine and Roses"" *",Henry Mancini,"Sammy Cahn & Jimmy Van Heusen for ""Call Me Irr...",[15]
7,1965,Jerry Herman,United States,"""Hello, Dolly!""",Louis Armstrong,"John Lennon & Paul McCartney for ""A Hard Day's...",[16]
8,1966,Paul Francis WebsterJohnny Mandel,United States,"""The Shadow of Your Smile""",Tony Bennett,"Michel Legrand, Norman Gimbel & Jacques Demy f...",
9,1967,John LennonPaul McCartney,United Kingdom,"""Michelle""",The Beatles,"John Barry & Don Black for ""Born Free"" perform...",


In [42]:
grammies[2]

Unnamed: 0,0,1
0,vteGrammy Award for Song of the Year,
1,1959−1980,"""Volare"" – Domenico Modugno (songwriter) (1959..."
2,1981−2000,"""Sailing"" – Christopher Cross (songwriter) (19..."
3,2001−present,"""Beautiful Day"" – Adam Clayton, David Evans, L..."


In [43]:
grammies[3]

Unnamed: 0,0,1
0,vteGrammy Award,
1,Categories Grammy Nominees Records Locations EGOT,
2,Special awards,Legend Award Lifetime Achievement Award Truste...
3,Ceremony year,1959 May Nov 1961 1962 1963 1964 1965 1966 196...
4,Related,Grammy Museum
5,By Country,American Argentine Australian Austrian Brazili...
6,Grammy Award Record of the Year Song of the Ye...,


In [44]:
grammies[4]

Unnamed: 0,0,1
0,vteGrammy Award categories,
1,General,Record of the Year Album of the Year Song of t...
2,Pop,Best Pop Solo Performance Best Pop Duo/Group P...
3,Dance/Electronic,Best Dance Recording Best Dance/Electronic Album
4,Rock,Best Rock Performance Best Metal Performance B...
5,Alternative,Best Alternative Music Album
6,R&B,Best R&B Performance Best Traditional R&B Perf...
7,Rap/Hip-Hop,Best Rap Performance Best Rap/Sung Performance...
8,Country,Best Country Solo Performance Best Country Duo...
9,Jazz,Best Improvised Jazz Solo Best Jazz Vocal Albu...


Another way with the same concept....

In [76]:

response = requests.get('https://en.wikipedia.org/wiki/List_of_American_Grammy_Award_winners_and_nominees')
soup = BeautifulSoup(response.text)

tab = soup.find("table",{"class":"wikitable sortable"})
df = pd.read_html(tab.prettify())



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [50]:
df

[                                                     0     1            2
 0                                              Nominee  Wins  Nominations
 1                                    Quincy Jones  [1]    28           80
 2                                   Alison Krauss  [2]    27           42
 3                                   Stevie Wonder  [3]    25           74
 4                               Vladimir Horowitz  [4]    25           45
 5                                   John Williams  [5]    24           69
 6                                         Beyoncé  [6]    23           66
 7                                           Jay-Z  [7]    22           77
 8                                     Chick Corea  [8]    22           64
 9                                      Kanye West  [9]    21           69
 10                                    Vince Gill  [10]    21           45
 11                                 Henry Mancini  [11]    20           72
 12                      

## Now find a free-range website

get in groups of four and try to scrape a website into a pandas df

In [140]:
another_website_url = requests.get('https://www.pro-football-reference.com/teams/nwe/2018.htm').text


In [141]:
another_soup = BeautifulSoup(another_website_url)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [142]:
# print(another_soup.prettify())

In [143]:
another_table = another_soup.find_all('table',{'class':'sortable'})

In [144]:
data_row = another_table[1].find_all('tr')

In [146]:
# tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())

# rows = tab.find_all('tr')
# get rows using tr tag 

data = []
for row in data_row:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])
# append in list data all the rows in the table within th and td tags and strip whitespace

df = pd.DataFrame(data)
# put the data into a pandas dataframe

In [147]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,,Score,Offense,Defense,Expected Points,,,,,,...,,,,,,,,,,
1,Week,Day,Date,,,,OT,Rec,,Opp,...,RushY,TO,1stD,TotYd,PassY,RushY,TO,Offense,Defense,Sp. Tms
2,1,Sun,September 9,1:00PM ET,boxscore,W,,1-0,,Houston Texans,...,122,3,21,325,158,167,2,4.29,8.05,-6.85
3,2,Sun,September 16,4:25PM ET,boxscore,L,,1-1,@,Jacksonville Jaguars,...,82,1,27,480,376,104,2,2.13,-18.76,4.44
4,3,Sun,September 23,8:20PM ET,boxscore,L,,1-2,@,Detroit Lions,...,89,1,25,414,255,159,1,-5.39,-13.69,2.81
5,4,Sun,September 30,1:00PM ET,boxscore,W,,2-2,,Miami Dolphins,...,175,2,11,172,116,56,2,21.22,12.47,-5.93
6,5,Thu,October 4,8:20PM ET,boxscore,W,,3-2,,Indianapolis Colts,...,97,2,26,439,355,84,3,13.90,-2.15,2.13
7,6,Sun,October 14,8:20PM ET,boxscore,W,,4-2,,Kansas City Chiefs,...,173,1,18,446,352,94,2,21.54,-10.83,-8.17
8,7,Sun,October 21,1:00PM ET,boxscore,W,,5-2,@,Chicago Bears,...,108,3,29,453,319,134,2,9.37,-16.89,5.95
9,8,Mon,October 29,8:15PM ET,boxscore,W,,6-2,@,Buffalo Bills,...,76,,16,333,287,46,2,3.65,20.98,-6.85


In [153]:
new_header = df.iloc[0]
# new_header is assigned first row which are the column labels 

df = df[1:]
#fill our dataframe and skip first row since it will be my new_header

df.columns = new_header
# the dataframe columns is assigned the new_header

In [152]:
df

2,1,Sun,September 9,1:00PM ET,boxscore,W,Unnamed: 7,1-0,Unnamed: 9,Houston Texans,...,122,3,21,325,158,167,2.1,4.29,8.05,-6.85
4,3,Sun,September 23,8:20PM ET,boxscore,L,,1-2,@,Detroit Lions,...,89.0,1.0,25.0,414.0,255.0,159.0,1.0,-5.39,-13.69,2.81
5,4,Sun,September 30,1:00PM ET,boxscore,W,,2-2,,Miami Dolphins,...,175.0,2.0,11.0,172.0,116.0,56.0,2.0,21.22,12.47,-5.93
6,5,Thu,October 4,8:20PM ET,boxscore,W,,3-2,,Indianapolis Colts,...,97.0,2.0,26.0,439.0,355.0,84.0,3.0,13.9,-2.15,2.13
7,6,Sun,October 14,8:20PM ET,boxscore,W,,4-2,,Kansas City Chiefs,...,173.0,1.0,18.0,446.0,352.0,94.0,2.0,21.54,-10.83,-8.17
8,7,Sun,October 21,1:00PM ET,boxscore,W,,5-2,@,Chicago Bears,...,108.0,3.0,29.0,453.0,319.0,134.0,2.0,9.37,-16.89,5.95
9,8,Mon,October 29,8:15PM ET,boxscore,W,,6-2,@,Buffalo Bills,...,76.0,,16.0,333.0,287.0,46.0,2.0,3.65,20.98,-6.85
10,9,Sun,November 4,8:20PM ET,boxscore,W,,7-2,,Green Bay Packers,...,123.0,,22.0,367.0,250.0,117.0,1.0,17.14,-4.92,3.52
11,10,Sun,November 11,1:00PM ET,boxscore,L,,7-3,@,Tennessee Titans,...,40.0,,23.0,385.0,235.0,150.0,,-5.76,-14.31,-2.47
12,11,,,,,,,,,Bye Week,...,,,,,,,,,,
13,12,Sun,November 25,1:00PM ET,boxscore,W,,8-3,@,New York Jets,...,215.0,,18.0,338.0,264.0,74.0,1.0,20.12,-2.42,-4.96
