# Collecting Data From the Internet

## Introduction

In the last lesson, we saw that we were able to access data from Google Books in just a few lines of code:

```python
import requests
response = requests.get("https://www.googleapis.com/books/v1/volumes?q=tom%20sawyer")
response.json()
```

In this lesson, we'll learn all about that long bit of code in red, the web address, or URL.

## Navigate the web like a scientist

[Click here](http://www.espn.com).

<img src="espn-screenshot.png" width="60%" />

What we'll see is ESPN's website.  And as a data scientist we need to know a little bit about how we get information from ESPN, or any website for that matter.  

The key part is the web address at the very top of the page.  Data scientists and developers call that a `url`, short for uniform resource locator.  Notice that if we click on any link on espn's website, it will almost always begin with that same url: `www.espn.com`.

<img src="./nba-nfl-2.gif" width="50%" />

So when we click on the link to the nba, the url becomes `www.espn.com/nba`.  The part `www.espn.com` root url, as you can think of it like the root of a tree.  When we go to section of the website like `www.espn.com/nba` we are reaching one branch of that tree, the branch about the NBA.

So if we want to see anything related to the NBA, we go to `www.espn.com/nba`.  And these urls will keep communicating to us.  For example, if we want to see the list of teams in the NBA, we go to `www.espn.com/nba/teams`.  And to see the list of NBA players we go to `www.espn.com/nba/players`.

## Introduction to resources

Both `nba/teams` and `nba/players` are called a resource.  And the Internet is organized this way.  It's not by accident.  The Internet used to be a jumbled mess.  Then this programmer Roy Fielding suggested developers we organize our websites into a pattern that he called RESTful, and developers happily complied. 

One of the main concepts of being RESTful is that when are looking for information, we are asking to see information about topic, like a list of NBA players, and that the URL should describe this topic.

So to see nba players, we should go to `www.espn.com/nba/players`.

### One last thing - query parameters

Now let's take a look at what happens if we want to search for specific information - say on Steph Curry.  From ESPN's home page, we'll type in his name in the search box and then press enter.  Notice what url ESPN ends up requesting.

<img src="query-parameter-search.gif" >


The important part of the URL is the `search/results?q=steph%20curry`.  This `?q=steph%20curry` is called a query parameter.  Essentially we are saying, we are requesting search results, but only those matching the query `steph curry`.

There are two components to a query parameter to know.

1. The `?` indicates that we are about to begin a query.
2. Then query parameters always follow a `field=value` pattern.  Just like in filling out a search field.  Here with `q=step%20curry` we are saying our query is steph curry.

Finally, the `%20` is there just in place of a space key.  Web browsers don't play too nicely with space keys so they replace them with `%20` instead.  Certain APIs allow us to query more than one field for example, a player by name, who are also play for a specific team - but we'll see more about that later.

### Why we care

Now, let's take another look at the code we had above.

```python
import requests
response = requests.get("https://www.googleapis.com/books/v1/volumes?q=tom%20sawyer")
response.json()
```

Now we can understand that part in red.

* Our root url is https://www.googleapis.com/.
* The resource is `/books/v1/volumes/` which says find books of type volumes from the first version of Google's API.
* And then of those volumes we have a query for tom sawyer.

So by constructing a URL like this, we can ask a website for specific kinds of information.  And don't worry, APIs like the Google books API do provide instructions what URLs to use -- but it's good to see the pattern that all APIs follow.

### Summary

In this lesson, we saw that websites follow a convention in how they organize content.  Content is organized by a resource.

When we access information about a resource, we generally use the plural form of the resource name, as in `/nba/players`.  Then we can be even more specific about what player information we would like to see by using query parameters.  Query parameters follow a pattern of `?field=value`, and they are used both when we search for information on a website, as well as when we request information via an API.