## Getting data from the Internet

We've seen about obtaining data from our local file system.

The other common place today that we might want to obtain data is from the internet.

It's very common today to treat the web as a source and store of information; we need to be able to **programmatically
download data, and place it in python objects**.

We may also want to be able to **programmatically *upload* data, for example, to automatically fill in forms**.

This can be really powerful if we want to, for example, do automated meta-analysis across a selection of research papers.

### URLs

<span style="background-color: #FFFF00">All internet resources are defined by a Uniform Resource Locator (URL). </span>

In [1]:
"http://maps.googleapis.com:80/maps/api/staticmap?size=400x400&center=51.51,-0.1275&zoom=12"

'http://maps.googleapis.com:80/maps/api/staticmap?size=400x400&center=51.51,-0.1275&zoom=12'

<span style="background-color: #FFFF00">A url consists of:</span>

* <span style="background-color: #FFFF00">A *scheme* (http, https, ssh, ...)</span>
* <span style="background-color: #FFFF00">A *host* (maps.googleapis.com, the name of the remote computer you want to talk to)</span>
* <span style="background-color: #FFFF00">A *port* (optional, most protocols have a typical port associated with them, e.g. 80 for http)</span>
* <span style="background-color: #FFFF00">A *path* (Like a file path on the machine, here it is maps/api/staticmap)</span>
* <span style="background-color: #FFFF00">A *query* part after a ?, (optional, usually ampersand-separated *parameters* e.g. size=400x400, or zoom=12)</span>

**Supplementary materials**: These can actually be different for different protocols, the above is a simplification, you can see more, for example, at
    https://en.wikipedia.org/wiki/URI_scheme

<font color = "blue">URLs are not allowed to include all characters; we need to, for example, "escape" a space that appears inside the URL,
replacing it with %20, so e.g. a request of http://some example.com/ would need to be http://some%20example.com/ </font>


**Supplementary materials**: <font color = "blue">The code used to replace each character is the [ASCII](http://www.asciitable.com) code for it.</font>

**Supplementary materials**: The escaping rules a are quite subtle. See https://en.wikipedia.org/wiki/Percent-encoding

### Requests

<span style="background-color: #FFFF00">The python [requests](http://docs.python-requests.org/en/latest/) library can help us manage and manipulate URLs. It is easier to use than the 'urllib' library that is part of the standard library, and is included with anaconda and canopy. It sorts out escaping, parameter encoding, and so on for us.</span>

To request the above URL, for example, we write:

In [1]:
%%bash

pip freeze

appdirs==1.4.3
appnope==0.1.0
bleach==2.0.0
cairocffi==0.8.0
CairoSVG==2.0.3
cffi==1.10.0
click==6.7
click-plugins==1.0.3
cligj==0.4.0
cssselect==1.0.1
cycler==0.10.0
decorator==4.0.11
descartes==1.1.0
dropbox==8.2.0
entrypoints==0.2.2
Fiona==1.7.6
geopandas==0.2.1
geopy==1.11.0
html5lib==0.999999999
ipykernel==4.6.1
ipython==6.0.0
ipython-genutils==0.2.0
ipywidgets==6.0.0
jedi==0.10.2
Jinja2==2.9.6
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.0.1
jupyter-console==5.1.0
jupyter-core==4.3.0
lxml==3.7.3
MarkupSafe==1.0
matplotlib==2.0.2
mistune==0.7.4
munch==2.1.1
nbconvert==5.1.1
nbformat==4.3.0
notebook==5.0.0
numpy==1.12.1
O365==0.9.5
oauthlib==2.0.2
olefile==0.44
packaging==16.8
pandas==0.19.2
pandas-datareader==0.3.0.post0
pandocfilters==1.4.1
patsy==0.4.1
pexpect==4.2.1
pickleshare==0.7.4
Pillow==4.1.1
plotly==2.0.8
prompt-toolkit==1.0.14
ptyprocess==0.5.1
pycparser==2.17
Pygments==2.2.0
pymongo==3.4.0
pyparsing==2.2.0
Pyphen==0.9.4
pyproj==1.9.5.1
python-dateutil==2.6.0
pytz

In [3]:
import requests

In [4]:
response = requests.get("http://maps.googleapis.com/maps/api/staticmap",
                        params={
        'size' : '400x400',
        'center' : '51.51,-0.1275',
        'zoom' : 12
    })

In [5]:
response.content[0:50]

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\x90\x00\x00\x01\x90\x08\x03\x00\x00\x00\xb7a\xc6\xfe\x00\x00\x03\x00PLTEI>&rP4KKK'

When we do a request, the result comes back as text. For the png image in the above, this isn't very readable:

Just as for file access, therefore, we will need to send the text we get to a python module which understands that file format.

Again, it is important to separate the *transport* model, (e.g. a file system, or an "http request" for the web, from the data model of the
                                                          data that is returned.)

### Example: Sunspots

Let's try to get something scientific: the sunspot cycle data from http://sidc.be/silso/home:

In [6]:
spots=requests.get('http://www.sidc.be/silso/INFO/snmtotcsv.php').text

In [7]:
spots[0:80]

'1749;01;1749.042;  96.7; -1.0;   -1;1\n1749;02;1749.123; 104.3; -1.0;   -1;1\n1749'

In [8]:
dir(requests.get("http://www.sidc.be/silso/INFO/snmtotcsv.php"))

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [9]:
type(requests.get("http://www.sidc.be/silso/INFO/snmtotcsv.php"))

requests.models.Response

This looks like semicolon-separated data, with different records on different lines. (Line separators come out as `\n`)

<center> ** -> There are many many scientific datasets which can now be downloaded like this <- ** </center>
<br>
<center> ** -> Integrating the download into your data pipeline can help to keep your data flows organised. <- ** </center>

<br>

Rather than downloading manually and then getting back into the programming flow ...

### Writing our own Parser

We'll need a python library to handle semicolon-separated data like the sunspot data.

You might be thinking: "But I can do that myself!":

In [8]:
lines=spots.split("\n")
lines[0:5]

['1749;01;1749.042;  96.7; -1.0;   -1;1',
 '1749;02;1749.123; 104.3; -1.0;   -1;1',
 '1749;03;1749.204; 116.7; -1.0;   -1;1',
 '1749;04;1749.288;  92.8; -1.0;   -1;1',
 '1749;05;1749.371; 141.7; -1.0;   -1;1']

In [9]:
years=[line.split(";")[0] for line in lines]

In [10]:
years[0:15]

['1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1750',
 '1750',
 '1750']

But **don't**: what if, for example, one of the records contains a separator inside it; most computers will put the content in quotes,
so that, for example,

    "something; something"; something; something
    
has three fields, the first of which is

    something; something
    
 The naive code above would give four fields, of which the first is 
 
    "Something

You'll never manage to get all that right; so you'll be better off using a library to do it.

### Writing data to the internet

<span style="background-color: #FFFF00"> Note that we're using `requests.get`. `get` is used to receive data from the web.
You can also use `post` to fill in a web-form programmatically. </span>

<font color ="red"> ** -> ** </font> **Supplementary material**: Learn about using `post` with [requests](http://docs.python-requests.org/en/latest/user/quickstart/).

<font color ="red"> ** -> ** </font> **Supplementary material**: Learn about the different kinds of [http request](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods): [Get, Post, Put, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)...

This can be used for all kinds of things, for example, to programmatically add data to a web resource. It's all well beyond
our scope for this course, but it's important to know it's possible, and start to think about the scientific possibilities.