# Web Scraping

**Examples**
1. Nobel Prizes - https://github.com/SBalas/2019-CS109A/blob/master/content/sections/section1/solutions/section_1_solutions.ipynb
2. Bourbon - https://harvard-iacs.github.io/2020-CS109A/lectures/lecture27/notebook/
3. Article - https://medium.com/@oguss/web-scraping-a-beginners-tips-on-how-to-inspect-websites-using-google-chrome-and-extract-required-768d24661ca0


In [279]:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [34]:
url = 'https://www.nytimes.com/'

# Get the webpage
homepage = requests.get(url)

# check status of getting: 200 is success, 400 is page not found
homepage.status_code

200

In [35]:
# use beautiful soup to access contents
soup = BeautifulSoup(homepage.text, 'html.parser')
soup = BeautifulSoup(homepage.content, 'html.parser')

# print raw data
#soup

In [38]:
soup.title  # print raw title

<title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>

In [39]:
soup.title.string  # extrac title

'The New York Times - Breaking News, US News, World News and Videos'

In [40]:
#soup

In [47]:
soup.p

<p class="css-5n6ag3">Steve Bannon, who was a top aide to former President Trump, had refused to comply with subpoenas investigating the Jan. 6 attack on Congress.</p>

In [53]:
soup.findAll('a') # all tags
soup.find('a')    # first a

<a class="css-1f8er69" href="#site-content">Skip to content</a>

In [58]:
# get all the links in the page
link_list = [l.get('href') for l in soup.findAll('a')]
link_list

['#site-content',
 '#site-index',
 '/',
 '/',
 '/international/?action=click&region=Editions&pgtype=Homepage',
 '/ca/?action=click&region=Editions&pgtype=Homepage',
 'https://www.nytimes.com/es/',
 'https://cn.nytimes.com',
 'https://www.nytimes.com/section/todayspaper',
 'https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi&redirect_uri=https%3A%2F%2Fwww.nytimes.com%2Fsubscription%2Fmultiproduct%2Flp8KQUS.html%3FcampaignId%3D7JFJX&asset=masthead',
 '/',
 '/',
 '/section/world',
 '/section/us',
 '/section/politics',
 '/section/nyregion',
 '/section/business',
 '/section/opinion',
 '/section/technology',
 '/section/science',
 '/section/health',
 '/section/sports',
 '/section/arts',
 '/section/books',
 '/section/style',
 '/section/food',
 '/section/travel',
 '/section/magazine',
 '/section/t-magazine',
 '/section/realestate',
 '/video',
 'https://www.nytimes.com/live/2021/11/13/climate/cop26-glasgow-climate-summit',
 'https://www.nytimes.com/live/2021/11/13/climate/

In [60]:
for link in soup.find_all('a'):
    print(link.get('href'))

#site-content
#site-index
/
/
/international/?action=click&region=Editions&pgtype=Homepage
/ca/?action=click&region=Editions&pgtype=Homepage
https://www.nytimes.com/es/
https://cn.nytimes.com
https://www.nytimes.com/section/todayspaper
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi&redirect_uri=https%3A%2F%2Fwww.nytimes.com%2Fsubscription%2Fmultiproduct%2Flp8KQUS.html%3FcampaignId%3D7JFJX&asset=masthead
/
/
/section/world
/section/us
/section/politics
/section/nyregion
/section/business
/section/opinion
/section/technology
/section/science
/section/health
/section/sports
/section/arts
/section/books
/section/style
/section/food
/section/travel
/section/magazine
/section/t-magazine
/section/realestate
/video
https://www.nytimes.com/live/2021/11/13/climate/cop26-glasgow-climate-summit
https://www.nytimes.com/live/2021/11/13/climate/cop26-glasgow-climate-summit
https://www.nytimes.com/live/2021/11/13/climate/cop26-glasgow-climate-summit
#after-dfp-ad-mid1
https

In [63]:
# print all the text
# print(soup.get_text())


## Example with Nobel prize archive

Use [2018 version of the Nobel website](http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/)
to extract list of nobel recipients by year and category

**HTML tags**

**\<h3> : header 3 tag** tag is a header size 3 tag (header 1 is the largest tag). This tag will contain the title and year of the nobel prize, which we will parse out. <br>
**\<h6> : header 6 tag** tag (smaller than header 3) will contain the prize recipients <br>
**\<p> : paragraph tag** tags used for text, contains the prize motivation <br>
**\<div class ="Class: by year">** "Content Division element ( \<div> ) is the generic container for flow content." What we care about here is the class attribute, which we will use with beautiful soup to quickly parse information which we want. The class attribute could be attatched to any tag.


In [66]:
url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'
snapshot = requests.get(url)
snapshot

<Response [200]>

In [77]:
# get all the text and print the first 500 chars
soup = BeautifulSoup(snapshot.text, 'html.parser')
all_text = soup.prettify()
all_text[:500]

'<!DOCTYPE html>\n<html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#">\n <head>\n  <script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript">\n  </script>\n  <script type="text/javascript">\n   window.addEventListener(\'DOMContentLoaded\',function(){var v=archive_analytics.values;v.service=\'wb\';v.server_name=\'wwwb-app227.us.archive.org\';v.server_ms=947;archive_analytics.send_pageview({});});\n  </script>\n  <script charset="utf-8" src="/_static/js/bundle-playback.js?v=U'

In [80]:
# get title
soup.title.string

'All Nobel Prizes'

In [266]:
# returns the objects in first titles
soup.select("h3 a")[0:5]

[<a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2018/">The Nobel Prize in Physics 2018</a>,
 <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/chemistry/laureates/2018/">The Nobel Prize in Chemistry 2018</a>,
 <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/medicine/laureates/2018/">The Nobel Prize in Physiology or Medicine 2018</a>,
 <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/literature/laureates/2018/">The Nobel Prize in Literature 2018</a>,
 <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/peace/laureates/2018/">The Nobel Peace Prize 2018</a>]

### Extract award data
Each award is in a ```div``` with a ```by_year``` class

In [87]:
award_nodes = soup.select('.by_year')
len(award_nodes)

640

In [113]:
# This lists all the awardees for Physics in 2017 and the description
award_node = award_nodes[6]
award_node

<div class="by_year">
<h3><a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/">The Nobel Prize in Physics 2017</a></h3>
<h6><a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/weiss-facts.html">Rainer Weiss</a>, <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/barish-facts.html">Barry C. Barish</a> and <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/thorne-facts.html">Kip S. Thorne</a></h6>
<p>“for decisive contributions to the LIGO detector and the observation of gravitational waves”</p>
</div>

In [122]:
award_node.select("h3")    # gives array in h3 --- size 1
award_node.select("h3")[0] # first object
award_node.select("h3")[0].text   # gives the text in it

'The Nobel Prize in Physics 2017'

In [133]:
# strip the year from the title
award_node.select("h3")[0].text[:-5]

'The Nobel Prize in Physics'

### Functions to extract title, year, reason, and recipients

In [455]:
def get_award_title(node):
    return node.select("h3")[0].text[:-5]

def get_award_year(node):
    return node.select("h3")[0].text[-4:]

def get_award_motivation(node):
    '''check if this section exits'''
    if not node.select('p') :
        return None
    return node.select('p')[0].text

def get_recipients(node):
    '''check if this section exits'''
    return [i.text for i in node.select('h6 a')]

### Get award title and year

In [141]:
print(get_award_title(award_node))
print(get_award_year(award_node))

The Nobel Prize in Physics
2017


In [142]:
# extract all the awards
list_awards = []
for award_node in award_nodes:
    list_awards.append(get_award_title(award_node))

In [262]:
# one line code of the same
list_awards = [get_award_title(award_node) for award_node in award_nodes ]
list_awards[1:10]

['The Nobel Prize in Chemistry',
 'The Nobel Prize in Physiology or Medicine',
 'The Nobel Prize in Literature',
 'The Nobel Peace Prize',
 'The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel',
 'The Nobel Prize in Physics',
 'The Nobel Prize in Chemistry',
 'The Nobel Prize in Physiology or Medicine',
 'The Nobel Prize in Literature']

### Get the recipients
- For each division class (year in this case) , **h3** gives the name of the prize, **h6** the list of recipients, and **p** a brief explanation of the work
- Note the differences between the methods used below

In [380]:
award_nodes[6].select('h6 a')

[<a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/weiss-facts.html">Rainer Weiss</a>,
 <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/barish-facts.html">Barry C. Barish</a>,
 <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/thorne-facts.html">Kip S. Thorne</a>]

In [381]:
award_nodes[6].select('h6')

[<h6><a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/weiss-facts.html">Rainer Weiss</a>, <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/barish-facts.html">Barry C. Barish</a> and <a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/2017/thorne-facts.html">Kip S. Thorne</a></h6>]

In [386]:
award_nodes[6].select('h6')[0].text   # gets all the names

'Rainer Weiss, Barry C. Barish and Kip S. Thorne'

In [440]:
award_nodes[6].select('h6 a')[0].text  # gets only the first name

'Rainer Weiss'

In [450]:
for i in award_nodes[6].select('h6 a'):
    print(i.text)

Rainer Weiss
Barry C. Barish
Kip S. Thorne


In [456]:
get_recipients(award_nodes[6])

['Rainer Weiss', 'Barry C. Barish', 'Kip S. Thorne']

### Get Links

In [170]:
# get the relevant links
len(award_nodes)  # 640 
recipient_links = [l.get("href") for l in award_nodes[639].select('h6, a')]
recipient_links

['http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/peace/laureates/1901/',
 None,
 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/peace/laureates/1901/dunant-facts.html',
 None,
 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/peace/laureates/1901/passy-facts.html']

### Get the reason for the prize

In [182]:
award_nodes[200].select('p')             # the array
award_nodes[200].select('p')[0]          # first element
award_nodes[200].select('p')[0].text     # just the text

'“for their discoveries concerning the regulation of cholesterol metabolism”'

In [231]:
# function to get the reason
def get_award_motivation(node):
    if not node.select('p') :
        return None
    return node.select('p')[0].text

In [296]:
list_motivation = []
list_motivation = [get_award_motivation(node) for node in award_nodes]
list_motivation[10:15]

['“for its work to draw attention to the catastrophic humanitarian consequences of any use of nuclear weapons and for its ground-breaking efforts to achieve a treaty-based prohibition of such weapons”',
 '“for his contributions to behavioural economics”',
 '“for theoretical discoveries of topological phase transitions and topological phases of matter”',
 '“for the design and synthesis of molecular machines”',
 '“for his discoveries of mechanisms for autophagy”']

## Create Pandas dataframe to store the extracted info

In [457]:
# create a set of dictionaries

awards = []

for node in soup.select(" .by_year"):
    ''' initialize dictionary to store attributes and values for each year'''
    dict = {} 
    
    recipients = get_recipients(node)
    
    dict['year'] = get_award_year(node)
    dict['title'] = get_award_title(node)
    dict['recipients'] = recipients
    dict['num_recipients'] = len(recipients)
    dict['motivation'] = get_award_motivation(node)
    
    
    awards.append(dict)


In [485]:
# Create dataframe from the dictionary
df_awards = pd.DataFrame(awards)
df_awards.head()

Unnamed: 0,year,title,recipients,num_recipients,motivation
0,2018,The Nobel Prize in Physics,[],0,The 2018 Nobel Prize in Physics has not been a...
1,2018,The Nobel Prize in Chemistry,[],0,The 2018 Nobel Prize in Chemistry has not been...
2,2018,The Nobel Prize in Physiology or Medicine,[],0,The 2018 Nobel Prize in Physiology or Medicine...
3,2018,The Nobel Prize in Literature,[],0,The 2018 Nobel Prize in Literature has been po...
4,2018,The Nobel Peace Prize,[],0,The 2018 Nobel Peace Prize has not been awarde...


In [462]:
df_awards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   year            640 non-null    object
 1   title           640 non-null    object
 2   recipients      640 non-null    object
 3   num_recipients  640 non-null    int64 
 4   motivation      571 non-null    object
dtypes: int64(1), object(4)
memory usage: 25.1+ KB


In [467]:
# where no motivation present
sum(df_awards.motivation.isnull())

69

In [469]:
# number of times there were 0, 1, 2, 3 rcipients
df_awards.num_recipients.value_counts()

1    347
2    138
3    100
0     55
Name: num_recipients, dtype: int64

## Example with Bourbon reviews

[Distiller.com](https://distiller.com/) is a site with reviews of various drinks. We pick Bourbon by searching it to get this url (https://distiller.com/search?term=bourbon) and then scrape info from it.

In [490]:
awards[5]

{'year': '2018',
 'title': 'The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel',
 'recipients': [],
 'num_recipients': 0,
 'motivation': 'The 2018 Prize in Economic Sciences has not been awarded yet. It will be announced on Monday 8 October, 11:45 a.m. at the earliest.'}