# Web scraping basics

See also the [Volumetrics notebook](Volumetrics.ipynb).

In [1]:
url = "http://en.wikipedia.org/wiki/Jurassic"

Use View Source in your browser to figure out where the age range is on the page, and what it looks like.

Try to find the same string here.

In [2]:
import requests
r = requests.get(url)
r.text[:500]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Jurassic - Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Jurassic","wgTitle":"Jurassic","wgCurRe'

Using a [regular expression](https://docs.python.org/2/library/re.html):

In [3]:
import re

s = re.search(r'<i>(.+?million years ago)</i>', r.text)
text = s.group(1)

Exercise: Make a function to get the start and end ages of *any* geologic period, taking the name of the period as an argument.

In [4]:
def get_age(period):
    url = "http://en.wikipedia.org/wiki/" + period
    r = requests.get(url)
    start, end = re.search(r'<i>([\.0-9]+)–([\.0-9]+)&#160;million years ago</i>', r.text).groups()
    return float(start), float(end)

In [5]:
period = "Jurassic"
get_age(period)

(201.3, 145.0)

In [6]:
def duration(period):
    t0, t1 = get_age(period)
    duration = t0 - t1
    response = "According to Wikipedia, the {0} period was {1:.2f} Ma long.".format(period, duration)
    return response

In [7]:
duration('Cretaceous')

'According to Wikipedia, the Cretaceous period was 79.00 Ma long.'