# Extracting Useful Information
*Curtis Miller*

Now that we can download a webpage and read its data, we can start turning messy web data into clean data ready to be analyzed. The list we created in the last video can be used to visit Wikipedia pages, find and extract the data we need, and move on to the next page. (At this point you should be conscious about how fast your script runs, where being fast is **bad**, not good; use timers to slow down your script.)

Let's first resume where we left off in the last video.

In [None]:
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import pandas as pd
from pandas import DataFrame
from time import sleep

%matplotlib inline

In [None]:
session = requests.Session()
header = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
          "Accept-Language": "en-US,en;q=0.5",
          "Connection": "keep-alive",
          "Referrer": "https://www.google.com/",
          "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:54.0) Gecko/20100101 Firefox/54.0"}

# The URL we are visiting
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates"
page = session.get(url, headers=header).text
nobelList = BeautifulSoup(page)

nobelListTable = nobelList.find("table", {"class": ["wikitable", "sortable"]})

links = dict()    # Will contain names and links
for node in nobelListTable.findAll("td"):
    if node.a != None and node.a.attrs["href"][0:6] == "/wiki/":    # Avoids bad links
        links[node.a.contents[0]] = node.a.attrs["href"]   # Name: Link format

links

Let's design a scraper for the first page in this list, and hope that it will work for the other pages.

In [None]:
baseurl = "https://en.wikipedia.org"

baseurl + links["Aage Bohr"]

In [None]:
person_page = session.get(baseurl + links["Aage Bohr"], headers=header).text
ppbsObj = BeautifulSoup(person_page)

In [None]:
# Look at the table with the birth date
ppbsObj.find("table", {"class": ["infobox", "vcard"]})

In [None]:
ppbsObj.find("span", {"class": "bday"})    # An easy way to grab the birthday; this is a class for span tags

In [None]:
datetime.strptime(ppbsObj.find("span", {"class": "bday"}).contents[0], "%Y-%m-%d")    # Fetching a birthday

After experimenting with a single page, we can extrapolate into a loop.

If you look at the list of links you'll see links to organizations and some to footnotes. We try to find a birthdate but if none exists, or if the link is not what we want, then we should simply skip that entry and move on to the next.

Additionally, our script should sleep between pages so it doesn't go too fast and overload Wikipedia's servers; here, my script sleeps for ten seconds (just to be safe).

In [None]:
datadict = dict()
for name, link in links.items():
    sleep(10)    # Wait ten seconds between pages
    print("Fetching: " + name)
    person_page = session.get(baseurl + links[name], headers=header).text
    ppbsObj = BeautifulSoup(person_page)
    bday_span = ppbsObj.find("span", {"class": "bday"})
    if bday_span != None:
        try:
            bday = datetime.strptime(bday_span.contents[0], "%Y-%m-%d")
            datadict[name] = {"Year": bday.year,
                              "Month": bday.month,
                              "Day": bday.day}
        except ValueError:
            pass

In [None]:
datadict

In [None]:
nobelData = DataFrame(datadict).T
nobelData

Now we can create the visualizations we want to see when Nobel laureates were born.

In [None]:
nobelData.Month.value_counts().sort_index().plot("bar")    # Month