# Collecting Data From the Internet

## Introduction

In the last lesson, we saw that we were able to access data from Google Books in just a few lines of code:

```python
import requests
response = requests.get("https://www.googleapis.com/books/v1/volumes?q=tom%20sawyer")
response.json()
```

In this lesson, we'll learn all about that long bit of code in red, the web address, or URL.

## Navigate the web like a scientist

[Click here](http://www.espn.com).

<img src="espn-screenshot.png" width="60%" />

What we'll see is ESPN's website.  And as a data scientist we need to know a little bit about how we get information from ESPN, or any website for that matter.  

The key part is the web address at the very top of the page.  We call that a `url`, short for uniform resource locator.  Notice that if we click on any link on espn's website, it will almost always begin with that same url: `www.espn.com`.

<img src="./nba-nfl-2.gif" width="50%" />

So when we click on the link to the nba, the url becomes `www.espn.com/nba`.  The part `www.espn.com` is called the root url, as it's almost like the base, or the root of a tree.  

If we want to see anything related to the NBA on `www.espn.com`, ESPN has organized it in terms of `www.espn.com/nba`.  And these urls will keep communicating to us.  For example, if we want to see the list of teams in the NBA, we go to `www.espn.com/nba/teams`.  And to see the list of NBA players we go to `www.espn.com/nba/players`.

## Introduction to resources

Both `nba/teams` and `nba/players` are called a resource.  And the Internet is organized this way.  It's not by accident.  The Internet used to be a jumbled mess.  Then this programmer Roy Fielding suggested developers we organize our websites into a pattern that he called RESTful, and developers happily complied. 

One of the main concepts of being RESTful is that when are looking for information, we either are asking to see information about a collection, like a list of NBA players, or a single item, like one NBA player.  And that we should follow the same pattern when organizing this information.  

So to see nba players, we should go to `www.espn.com/nba/players`, and to see a single nba player like Steph Curry, we should go to `www.espn.com/nba/players/steph_curry`.

<img src="wiki-rest.png" width="50%">

> REST is short for Representational State Transfer, whatever that means.  If you're interested, take a look at a snippet of the Wikipedia  article on the subject above.

Let's see if ESPN follows this RESTful pattern.

<img src="steph-curry.png" width="60%">

Ok, so the url is `http://www.espn.com/nba/player/_/id/3975/stephen-curry`.  So it's close, not perfect.  One change is that ESPN uses `www.espn.com/nba/player` instead of 'players'.  Another change is that it looks like the key identifier is `/id/3975`, which is probably stephen curry's player id. 

An **id** is just a unique number (or list of characters) that can identify a specific item -- like a social security number. ESPN and almost all websites prefer to use IDs because multiple items could have the same name.  

So now that we know a little bit about how websites are organized, we can pick up on the pattern that any specific website uses!

### Why we care

So far we learned some of the patterns that websites organize information.  This is important, because we will soon programs that visit websites for us, and capture that information.  And if we say, want to go to every player's website in the NBA, we could visit `http://www.espn.com/nba/player/_/id/3975`, then `http://www.espn.com/nba/player/_/id/3976` and so on.  

So the next thing we have to learn is how to information once we navigate to these websites, and then how to use a program like Python to do so.  

### Summary

In this lesson, we saw that websites follow a convention in how they organize content.  Content is organized in a RESTful pattern by a resource.

By thinking of information in terms of a resource, we can think of accessing information online in terms of accessing information about the collection, or a single member of that resource.  When we access information about the collection, we generally use the plural form of the resource name, as in `/nba/players`, and when we access a single member of that resource we can expect to see the resource followed by that specific member id, as in `/nba/players/3976`.  

We also saw that because this is a convention, websites may not follow this convention exactly, but by knowing about this convention we can more easily see the pattern in how a website organizes it's information.

Next up, we'll see how to retrieve this information on the web.