# A First Scraping Foray for Wikipedia’s table of Nobel Prize winners

We’ll start at the [main Wikipedia Nobel Prize page](https://en.wikipedia.org/wiki/List_of_Nobel_laureates). Scrolling down shows a table with all the laureates by year and
category, which is a good start to our minimal data requirements.

In [1]:
from bs4 import BeautifulSoup
import requests

In [None]:
"""
The install_cache method has a number of useful options, including
allowing you to specify the cache backend (sqlite, memory,mongdb,
or redis) or set an expiry time (expiry_after) in seconds on the
caching. So the following creates a cache named nobel_pages with an 
sqlite backend and pages that expire in two hours (7,200s).
"""
import requests_cache
requests_cache.install_cache()
# requests_cache.install_cache('nobel_pages', backend='sqlite', expire_after=7200)

In [3]:
BASE_URL = 'https://en.wikipedia.org'
# Wikipedia will reject our request unless we add
# a 'User-Agent' attribute to our http header.
HEADERS = {'User-Agent': 'Mozilla/5.0'}

In [4]:
def get_Nobel_soup():
    """ Return a parsed tag tree of our Nobel prize page """
    # Make a request to the Nobel page, setting valid headers
    response = requests.get(
        BASE_URL + '/wiki/List_of_Nobel_laureates',
        headers=HEADERS)
    # Return the content of the response parsed by BeautifulSoup
    # The second argument specifies the parser we want to use, namely lxml’s
    return BeautifulSoup(response.content, "lxml")  

# Selecting Tags

In [5]:
soup = get_Nobel_soup()

In [6]:
"""
BeautifulSoup’s find method to find the first table tag with
those classes. find takes a tag name as its first argument and
a dictionary with class, id, and other identifiers as its second
"""

'\nBeautifulSoup’s find method to find the first table tag with\nthose classes. find takes a tag name as its first argument and\na dictionary with class, id, and other identifiers as its second\n'

## use find()

In [7]:
"""
because the vscode has output character limit. So change 
result to string, and check head of string
"""
str( soup.find('table', {'class':'wikitable sortable'}) )[0:100]

'<table class="wikitable sortable">\n<tbody><tr>\n<th>Year\n</th>\n<th width="18%"><a href="/wiki/List_of'

In [8]:
"""
Although we have successfully found our table by its classes, this
method is not very robust. Let’s see what happens when we change
the order of our CSS classes
"""
soup.find('table', {'class':'sortable wikitable'})
# nothing returned

"""
So find cares about the order of the classes, using the class string to
find the tag. If the classes were specified in a different order—something
that might well happen during an HTML edit, then the find
fails.
"""

'\nSo find cares about the order of the classes, using the class string to\nfind the tag. If the classes were specified in a different order—something\nthat might well happen during an HTML edit, then the find\nfails.\n'

## use lxml’s CSS selectors

In [9]:
"""
Using the soup’s select method (available if you specified the lxml
parser when creating it), you can specify an HTML element using its
CSS class, id, and so on. 

selection syntax should be familiar to anyone who’s used 
JavaScript’s jQuery library and is also similar to that used by D3.
CSS class, id, and so on. This CSS selector is converted into the
xpath syntax lxml uses internally.
"""
table = soup.select('table.sortable.wikitable')

In [10]:
str(table)[0:1000]

'[<table class="wikitable sortable">\n<tbody><tr>\n<th>Year\n</th>\n<th width="18%"><a href="/wiki/List_of_Nobel_laureates_in_Physics" title="List of Nobel laureates in Physics">Physics</a>\n</th>\n<th width="16%"><a href="/wiki/List_of_Nobel_laureates_in_Chemistry" title="List of Nobel laureates in Chemistry">Chemistry</a>\n</th>\n<th width="18%"><a href="/wiki/List_of_Nobel_laureates_in_Physiology_or_Medicine" title="List of Nobel laureates in Physiology or Medicine">Physiology<br/>or Medicine</a>\n</th>\n<th width="16%"><a href="/wiki/List_of_Nobel_laureates_in_Literature" title="List of Nobel laureates in Literature">Literature</a>\n</th>\n<th width="16%"><a href="/wiki/List_of_Nobel_Peace_Prize_laureates" title="List of Nobel Peace Prize laureates">Peace</a>\n</th>\n<th width="15%"><a class="mw-redirect" href="/wiki/List_of_Nobel_laureates_in_Economics" title="List of Nobel laureates in Economics">Economics</a><br/>(The Sveriges Riksbank Prize)<sup class="reference" id="cite_ref-1

In [11]:
"""
lxml provides the select_one convenience method
if you are selecting just one HTML element.
"""
table = soup.select_one('table.sortable.wikitable')

In [12]:
str(table)[0:1000]

'<table class="wikitable sortable">\n<tbody><tr>\n<th>Year\n</th>\n<th width="18%"><a href="/wiki/List_of_Nobel_laureates_in_Physics" title="List of Nobel laureates in Physics">Physics</a>\n</th>\n<th width="16%"><a href="/wiki/List_of_Nobel_laureates_in_Chemistry" title="List of Nobel laureates in Chemistry">Chemistry</a>\n</th>\n<th width="18%"><a href="/wiki/List_of_Nobel_laureates_in_Physiology_or_Medicine" title="List of Nobel laureates in Physiology or Medicine">Physiology<br/>or Medicine</a>\n</th>\n<th width="16%"><a href="/wiki/List_of_Nobel_laureates_in_Literature" title="List of Nobel laureates in Literature">Literature</a>\n</th>\n<th width="16%"><a href="/wiki/List_of_Nobel_Peace_Prize_laureates" title="List of Nobel Peace Prize laureates">Peace</a>\n</th>\n<th width="15%"><a class="mw-redirect" href="/wiki/List_of_Nobel_laureates_in_Economics" title="List of Nobel laureates in Economics">Economics</a><br/>(The Sveriges Riksbank Prize)<sup class="reference" id="cite_ref-13

In [13]:
"""
As a shorthand for select, you can call the tag directly on the soup;
so these two are equivalent
"""
table.select('th')
table('th')

[<th>Year
 </th>,
 <th width="18%"><a href="/wiki/List_of_Nobel_laureates_in_Physics" title="List of Nobel laureates in Physics">Physics</a>
 </th>,
 <th width="16%"><a href="/wiki/List_of_Nobel_laureates_in_Chemistry" title="List of Nobel laureates in Chemistry">Chemistry</a>
 </th>,
 <th width="18%"><a href="/wiki/List_of_Nobel_laureates_in_Physiology_or_Medicine" title="List of Nobel laureates in Physiology or Medicine">Physiology<br/>or Medicine</a>
 </th>,
 <th width="16%"><a href="/wiki/List_of_Nobel_laureates_in_Literature" title="List of Nobel laureates in Literature">Literature</a>
 </th>,
 <th width="16%"><a href="/wiki/List_of_Nobel_Peace_Prize_laureates" title="List of Nobel Peace Prize laureates">Peace</a>
 </th>,
 <th width="15%"><a class="mw-redirect" href="/wiki/List_of_Nobel_laureates_in_Economics" title="List of Nobel laureates in Economics">Economics</a><br/>(The Sveriges Riksbank Prize)<sup class="reference" id="cite_ref-13"><a href="#cite_note-13">[13]</a></sup>
 <

## Crafting Selection Patterns

In [14]:
def get_column_titles(table):
    """ Get the Nobel categories from the table header """
    cols = []
    for th in table.select_one('tr').select('th')[1:]:
        link = th.select_one('a')
        # Store the category name and any Wikipedia link it has
        if link:
            cols.append({'name':link.text, 'href':link.attrs['href']})
        else:
            cols.append({'name':th.text, 'href':None})
    return cols

In [15]:
get_column_titles(table)

[{'name': 'Physics', 'href': '/wiki/List_of_Nobel_laureates_in_Physics'},
 {'name': 'Chemistry', 'href': '/wiki/List_of_Nobel_laureates_in_Chemistry'},
 {'name': 'Physiologyor Medicine',
  'href': '/wiki/List_of_Nobel_laureates_in_Physiology_or_Medicine'},
 {'name': 'Literature', 'href': '/wiki/List_of_Nobel_laureates_in_Literature'},
 {'name': 'Peace', 'href': '/wiki/List_of_Nobel_Peace_Prize_laureates'},
 {'name': 'Economics', 'href': '/wiki/List_of_Nobel_laureates_in_Economics'}]

In [16]:
def get_Nobel_winners(table):
    cols = get_column_titles(table)
    winners = []
    for row in table.select('tr')[1:-1]:  # Gets all the Year rows, starting from the second, corresponding to the rows
        year = int(row.select_one('td').text) # Gets 1st <td>
        for i, td in enumerate(row.select('td')[1:]):  # Finds the <td> data cells
            for winner in td.select('a'):
                href = winner.attrs['href']
                if not href.startswith('#endnote'):
                    winners.append({
                        'year':year,
                        'category':cols[i]['name'],
                        'name':winner.text,
                        'link':winner.attrs['href']
                    })
    return winners

In [35]:
winners = get_Nobel_winners(table)[0:50]

## Example 5-3. Scraping the winner’s country from their biography page

In [37]:
def get_winner_nationality(w):
    """ scrape biographic data from the winner's wikipedia page """
    soup = get_url('http://en.wikipedia.org' + w['link'])
    person_data = {'name': w['name']}
    attr_rows = soup.select('table.infobox tr')
    for tr in attr_rows:
        try:
            attribute = tr.select_one('th').text
            if attribute == 'Nationality':
                person_data[attribute] = tr.select_one('td').text
        except AttributeError:
            pass

    return person_data

In [38]:
def get_url(url):
    HEADERS = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=HEADERS)
    return BeautifulSoup(response.content, "lxml")

## Example 5-4. Testing for scraped nationalities

In [None]:
"""
write a process to check missing data
"""

In [39]:
wdata = []
# test first 50 winners
for w in winners[:50]:
    wdata.append(get_winner_nationality(w))
missing_nationality = []
for w in wdata:
    # if missing 'Nationality' add to list
    if not w.get('Nationality'):
        missing_nationality.append(w)
# output list
missing_nationality

[{'name': 'Élie Ducommun'},
 {'name': 'Charles Albert Gobat'},
 {'name': 'Pierre Curie'},
 {'name': 'Marie Curie'},
 {'name': 'Niels Ryberg Finsen'},
 {'name': 'Ivan Pavlov'},
 {'name': 'Institut de Droit International'},
 {'name': 'Philipp Lenard'},
 {'name': 'Bertha von Suttner'},
 {'name': 'Santiago Ramón y Cajal'},
 {'name': 'Theodore Roosevelt'},
 {'name': 'Ernesto Teodoro Moneta'},
 {'name': 'Louis Renault'},
 {'name': 'Ernest Rutherford'},
 {'name': 'Paul Ehrlich'},
 {'name': 'Rudolf Christoph Eucken'},
 {'name': 'Klas Pontus Arnoldson'}]