# Lab 2 (counts as 50% of HW 1): Web scraping and data collection 

UIC CS 418, Spring 2019

_You are encouraged to work with a partner but the code that you submit should be your own (e.g., you cannot copy each other's code). Students posting solutions to this homework assignment is a violation of the academic integrity policy, as stated in the course syllabus._

This lab will count as *50%* of your first homework assignment. You should be able to complete most, if not all, of it during class on February 4th, 2019.

In this lab you will practice collecting and processing data in Python. By the end of this exercise hopefully you should look at the wonderful world wide web without fear, comforted by the fact that anything you can see with your human eyes, a computer can see with its computer eyes. In particular, we aim to give you some familiarity with:

* Using HTTP to fetch the content of a website
* HTTP Requests (and lifecycle)
* RESTful APIs
    * Authentication (OAuth)
    * Pagination
    * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

Since everyone loves food (presumably), the ultimate end goal of this homework will be to acquire the data to answer some questions and hypotheses about the restaurant scene in Chicago (which we will get to later). We will download __both__ the metadata on restaurants in Chicago from the Yelp API and with this metadata, retrieve the comments/reviews and ratings from users on restaurants.


### Library Documentation

For this lab, you need to look up online documentation for the Python packages you will use:

* Standard Library: 
    * [io](https://docs.python.org/2/library/io.html)
    * [time](https://docs.python.org/2/library/time.html)
    * [json](https://docs.python.org/2/library/json.html)

* Third Party
    * [requests](http://docs.python-requests.org/en/master/)
    * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
    * [yelp-fusion](https://www.yelp.com/developers/documentation/v3/get_started)

**Note:** You may come across a `yelp-python` library online. The library is deprecated and incompatible with the current Yelp API, so do not use the library.


## Due Date

This assignment is due at 11:59pm on February 8, 2019. Note that Lab 2 is due at the same time as Lab 1, and if any of them are submitted late, the late policy for HW 1 will apply. Instructions on how to submit Lab 2 to Gradescope are given at the end of the notebook.

## Setup

First, import necessary libraries:

In [1]:
import io, time, json
import requests
from bs4 import BeautifulSoup

## Authentication and working with APIs

There are various authentication schemes that APIs use, listed here in relative order of complexity:

* No authentication
* [HTTP basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
* Cookie based user login
* OAuth (v1.0 & v2.0, see this [post](http://stackoverflow.com/questions/4113934/how-is-oauth-2-different-from-oauth-1) explaining the differences)
* API keys
* Custom Authentication

For the NYT example below (**Q0**), since it is a publicly visible page we did not need to authenticate. HTTP basic authentication isn't too common for consumer sites/applications that have the concept of user accounts (like Facebook, LinkedIn, Twitter, etc.) but is simple to setup quickly and you often encounter it on with individual password protected pages/sites. 

Cookie based user login is what the majority of services use when you login with a browser (i.e. username and password). Once you sign in to a service like Facebook, the response stores a cookie in your browser to remember that you have logged in (HTTP is stateless). Each subsequent request to the same domain (i.e. any page on `facebook.com`) also sends the cookie that contains the authentication information to remind Facebook's servers that you have already logged in.

Many REST APIs however use OAuth (authentication using tokens) which can be thought of a programmatic way to "login" _another_ user. Using tokens, a user (or application) only needs to send the login credentials once in the initial authentication and as a response from the server gets a special signed token. This signed token is then sent in future requests to the server (in place of the user credentials).

A similar concept common used by many APIs is to assign API Keys to each client that needs access to server resources. The client must then pass the API Key along with _every_ request it makes to the API to authenticate. This is because the server is typically relatively stateless and does not maintain a session between subsequent calls from the same client. Most APIs (including Yelp) allow you to pass the API Key via a special HTTP Header: "Authorization: Bearer <API_KEY>". Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.


## Q0: Basic HTTP Requests (No authentication)

First, let's do the "hello world" of making web requests with Python to get a sense for how to programmatically access web pages: an (unauthenticated) HTTP GET to download a web page.

Fill in the funtion to use `requests` to download and return the raw HTML content of the URL passed in as an argument. As an example try the following NYT article (on Facebook's algorithmic news feed): [http://www.nytimes.com/2016/08/28/magazine/inside-facebooks-totally-insane-unintentionally-gigantic-hyperpartisan-political-media-machine.html](http://www.nytimes.com/2016/08/28/magazine/inside-facebooks-totally-insane-unintentionally-gigantic-hyperpartisan-political-media-machine.html)

Your function should return a tuple of: (`<status_code>`, `<text>`). (Hint: look at the **Library documentation** listed earlier to see how `requests` should work.) 

In [2]:
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    
    #[YOUR CODE HERE]
    r = requests.get(url)
    status_code = r.status_code
    raw_html = r.text
    return status_code, raw_html;
    

In [3]:
facebook_article = retrieve_html('http://www.nytimes.com/2016/08/28/magazine/inside-facebooks-totally-insane-unintentionally-gigantic-hyperpartisan-political-media-machine.html')
print(facebook_article)
# (200, u'<!DOCTYPE html>\n<html lang="en" itemId="https://www.nytimes.com/2016/08/28/magazine/inside...')



Now while this example might have been fun, we haven't yet done anything more than we could with a web browser. To really see the power of programmatically making web requests we will need to interact with an API. For the rest of this lab we will be working with the [Yelp API](https://www.yelp.com/developers/documentation/v3/get_started) and Yelp data (for an extensive data dump see their [Academic Dataset Challenge](https://www.yelp.com/dataset_challenge)). 

## Yelp API Access

The reasons for using the Yelp API are 3 fold:

1. Incredibly rich dataset that combines:
    * entity data (users and businesses)
    * preferences (i.e. ratings)
    * geographic data (business location and check-ins)
    * temporal data
    * text in the form of reviews
    * and even images.
2. Well [documented API](https://www.yelp.com/developers/documentation/v3/get_started) with thorough examples.
3. Extensive data coverage so that you can find data that you know personally (from your home town/city or account). This will help with understanding and interpreting your results.

Yelp used to use OAuth tokens but has now switched to API Keys. **For the sake of backwards compatibility Yelp still provides a Client ID and Secret for OAuth, but you will not need those for this assignment.** 

To access the Yelp API, we will need to go through a few more steps than we did with the first NYT example. Most large web scale companies use a combination of authentication and rate limiting to control access to their data to ensure that everyone using it abides. The first step (even before we make any request) is to setup a Yelp account if you do not have one and get API credentials.

1. Create a [Yelp](https://www.yelp.com/login) account (if you do not have one already)
2. [Generate API keys](https://www.yelp.com/developers/v3/manage_app) (if you haven't already). You will only need the API Key (not the Client ID or Client Secret) -- more on that later.

Now that we have our accounts setup we can start making requests! 


## Q1: Authenticated HTTP Request with the Yelp API

First, store your Yelp credentials in a local file (kept out of version control) which you can read in to authenticate with the API. This file can be any format/structure since you will fill in the function stub below.

For example, you may want to store your key in a file called `yelp_api_key.txt` (run in terminal):
```bash
echo 'YELP_API_KEY' > api_key.txt
```

**KEEP THE API KEY FILE PRIVATE AND OUT OF VERSION CONTROL (and definitely do not submit them to Gradescope!)**

You can then read from the file using:

In [4]:
with open('api_key.txt', 'r') as f:
    api_key = f.read().replace('\n','')
    print(api_key)
    # use your api_key

ZsmDDQprEmaLzWBl1gRlnCuGQlaBFQRiPrnbxVnpXNOlgi51Zzx7Nv2CN9BrdgXUKDwS0rL29bZsmDcZ5OFonmcvN1ov4tdmMPYDnfWsaePo5INK9Ag7F2AfCsRYXHYx


In [5]:
def read_api_key(filepath):
    """
    Read the Yelp API Key from file.
    
    Args:
        filepath (string): File containing API Key
    Returns:
        api_key (string): The API Key
    """
    
    # feel free to modify this function if you are storing the API Key differently
    with open(filepath, 'r') as f:
        return f.read().replace('\n','')

Using the Yelp API, fill in the following function stub to make an authenticated request to the [search](https://www.yelp.com/developers/documentation/v3/business_search) endpoint.

In [6]:
def yelp_search(api_key, query, offset=0):
    """
    Make an authenticated request to the Yelp API.

    Args:
        query (string): Search term

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the query
        businesses (list): list of dicts representing each business
    """
    
    #[YOUR CODE HERE]
    url = 'https://api.yelp.com/v3/businesses/search'
    url_params = {'location':query.replace(' ','+').replace(' ',''), 'offset':offset}
    header = {'Authorization': 'Bearer %s' % api_key}
    
    response = requests.get(url, headers = header, params=url_params)
    resp_dict = response.json()
    
    return resp_dict['total'], resp_dict['businesses']
    
    
    

When writing the python request, you'll need to pass in a custom header as well as a parameter. As a test, search for businesses in Chicago. You should find ~8,500 total depending on when you search (but this will actually differ from the number of actual Business objects returned... more on this in the next section).

In [7]:
api_key = read_api_key('api_key.txt')
num_records, data = yelp_search(api_key, 'Chicago')
print(num_records)
#8500
print(list(map(lambda x: x['name'], data)))
#['Girl & the Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', 'Art Institute of Chicago', 'Smoque BBQ', "Lou Malnati's Pizzeria", 'Alinea', "Kuma's Corner - Belmont", 'Little Goat Diner', "Bavette's Bar & Boeuf", 'Cafe Ba-Ba-Reeba!', "Portillo's Hot Dogs", 'Quartino Ristorante', "Pequod's Pizzeria", 'Crisp', "Joe's Seafood, Prime Steak & Stone Crab", 'Xoco', "Molly's Cupcakes", 'Millennium Park']

8400
['Girl & the Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', 'Art Institute of Chicago', 'Smoque BBQ', "Lou Malnati's Pizzeria", 'Alinea', "Kuma's Corner - Belmont", 'Little Goat Diner', "Bavette's Bar & Boeuf", 'Cafe Ba-Ba-Reeba!', "Portillo's Hot Dogs", 'Quartino Ristorante', "Pequod's Pizzeria", 'Crisp', "Joe's Seafood, Prime Steak & Stone Crab", 'Xoco', "Molly's Cupcakes", 'Millennium Park']


Now that we have completed the "hello world" of working with the Yelp API, we are ready to really fly! The rest of the exercise will have a bit less direction since there are a variety of ways to retrieve the requested information but you should have all the component knowledge at this point to work with the API. Yelp being a fairly general platform actually has many more business than just restaurants, but by using the flexibility of the API we can ask it to only return the restaurants.

## Parameterization and Pagination

And before we can get any reviews on restaurants, we need to actually get the metadata on ALL of the restaurants in Chicago. Notice above that while Yelp told us that there are ~8500, the response contained far fewer actual `Business` objects. This is due to pagination and is a safeguard against returning __TOO__ much data in a single request (what would happen if there were 100,000 restaurants?) and can be used in conjuction with _rate limiting_ as well as a way to throttle and protect access to Yelp data.

> As a thought exercise, consider: If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Yelp API to filter the results it returns.

## Q2: Acquire all of the restaurants in Chicago (on Yelp)

Again using the [API documentation](https://www.yelp.com/developers/documentation/v3/business_search) for the `search` endpoint, fill in the following function to retrieve all of the _Restuarants_ (using categories) for a given query. Again you should use your `read_api_key()` function outside of the `all_restaurants()` stub to read the API Key used for the requests. You will need to account for __pagination__ and __[rate limiting](https://www.yelp.com/developers/faq)__ to:

1. Retrieve all of the Business objects (# of business objects should equal `total` in the response). Paginate by querying 20 restaurants each request.
2. Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).  

As always with API access, make sure you follow all of the [API's policies](https://www.yelp.com/developers/api_terms) and use the API responsibly and respectfully.

** DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED **

In [8]:
def all_restaurants(api_key, query):
    """
    Retrieve ALL the restaurants on Yelp for a given query.

    Args:
        query (string): Search term

    Returns:
        results (list): list of dicts representing each restaurant
    """
    offset = 0  
    num_records, data = yelp_search(api_key, 'Chicago')
    url = 'https://api.yelp.com/v3/businesses/search'
    url_params = {'categories': 'restaurants, All','location':query.replace(' ','+').replace(' ',''), 'offset':offset, 'limit':50}
    header = {'Authorization': 'Bearer %s' % api_key}
    response_list = []
    response = requests.get(url, headers = header, params=url_params)
    resp_dict = response.json()
    total_keys = resp_dict['total']
    
    n_call = total_keys/50
    
    while n_call > 0:
        response = requests.get(url, headers = header, params=url_params)
        time.sleep(0.2)
        resp_dict = response.json()
        for res in resp_dict['businesses']:
            response_list.append(res)
        offset = offset + 50
        n_call = n_call - 1 
        url_params = {'categories': 'restaurants, All','location':query.replace(' ','+').replace(' ',''), 'offset':offset, 'limit':50}
        #print (url_params)
        
    return response_list
    
    

You can test your function with an individual neighborhood in Chicago (for example, Greektown). Chicago itself has a lot of restaurants... meaning it will take a lot of time to download them all.

In [9]:
#api_key = read_api_key('yelp_api_key.txt')
data = all_restaurants(api_key, 'Greektown, Chicago, IL')
#print(data)
print(len(data))
# 101
print(list(map(lambda x:x['name'], data)))
# ['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'WJ Noodles', 'Athena Greek Restaurant', ...]

102
['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'WJ Noodles', 'Athena Greek Restaurant', 'Zeus Restaurant', 'Santorini', 'Green Street Smoked Meats', 'Mr Greek Gyros', "Philly's Best", 'Primos Chicago Pizza Pasta', 'J.P. Graziano Grocery', 'Monteverde', '9 Muses', 'Sizzling Pot King', 'Sepia', 'Spectrum Bar and Grill', 'Green Street Local', 'High Five Ramen', 'Square Roots Kitchen', "Lou Mitchell's Restaurant & Bakery", "Nando's PERi-PERi", 'The Allis', 'Taco Burrito King', 'Dine', 'Jubilee Juice & Grill', 'RM Champagne', "Formento's", 'M2 Cafe', 'Loop Juice', 'Chicken & Farm Shop', 'La Sardine', "Blaze Fast-Fire'd Pizza", 'Parlor Pizza Bar', 'H Mart - Chicago', 'The Madison Bar and Kitchen', "Bombacigno's J & C Inn", 'Blackwood BBQ', 'Morgan Street Cafe', 'El Che Steakhouse & Bar', 'Yolk West Loop', 'DrinkHaus Supper Club', "Vero's Caffe & Gelato", "Giordano's", 'Ciao! Cafe & Wine Lounge', 'Umami Burger - West Loop', 'Sushi Pink', "Nonna's Pizza & Sandwiches", "

Now that we have the metadata on all of the restaurants in Greektown (or at least the ones listed on Yelp), we can retrieve the reviews and ratings. The Yelp API gives us aggregate information on ratings but it doesn't give us the review text or individual users' ratings for a restaurant. For that we need to turn to web scraping, but to find out what pages to scrape we first need to parse our JSON from the API to extract the URLs of the restaurants.

In general, it is a best practice to seperate the act of __downloading__ data and __parsing__ data. This ensures that your data processing pipeline is modular and extensible (and autogradable ;). This decoupling also solves the problem of expensive downloading but cheap parsing (in terms of computation and time).

## Q 2.5: Parse the API Responses and Extract the URLs

Because we want to separate the __downloading__ from the __parsing__, fill in the following function to parse the URLs pointing to the restaurants on `yelp.com`. As input your function should expect a string of [properly formatted JSON](http://www.json.org/) (which is similar to __BUT__ not the same as a Python dictionary) and as output should return a Python list of strings. Hint: print your `data` to see the JSON-formatted information you have. The input JSON will be structured as follows (same as the [sample](https://www.yelp.com/developers/documentation/v3/business_search) on the Yelp API page):

```json
{
  "total": 8228,
  "businesses": [
    {
      "rating": 4,
      "price": "$",
      "phone": "+14152520800",
      "id": "four-barrel-coffee-san-francisco",
      "is_closed": false,
      "categories": [
        {
          "alias": "coffee",
          "title": "Coffee & Tea"
        }
      ],
      "review_count": 1738,
      "name": "Four Barrel Coffee",
      "url": "https://www.yelp.com/biz/four-barrel-coffee-san-francisco",
      "coordinates": {
        "latitude": 37.7670169511878,
        "longitude": -122.42184275
      },
      "image_url": "http://s3-media2.fl.yelpcdn.com/bphoto/MmgtASP3l_t4tPCL1iAsCg/o.jpg",
      "location": {
        "city": "San Francisco",
        "country": "US",
        "address2": "",
        "address3": "",
        "state": "CA",
        "address1": "375 Valencia St",
        "zip_code": "94103"
      },
      "distance": 1604.23,
      "transactions": ["pickup", "delivery"]
    }
  ],
  "region": {
    "center": {
      "latitude": 37.767413217936834,
      "longitude": -122.42820739746094
    }
  }
}
```

In [10]:
def parse_api_response(data):
    """
    Parse Yelp API results to extract restaurant URLs.
    
    Args:
        data (string): String of properly formatted JSON.

    Returns:
        (list): list of URLs as strings from the input JSON.
    """
    
    #[YOUR CODE HERE]
    url_list =[]
    dict_data = json.loads(data)
    for business in dict_data['businesses']:
        url_list.append(business['url'])
    return url_list

    
    

In [11]:
data = '''{
"total": 8400,
"businesses": [
{
"id": "qjnpkS8yZO8xcyEIy5OU9A",
"alias": "girl-and-the-goat-chicago",
"name": "Girl & the Goat",
"image_url": "https://s3-media1.fl.yelpcdn.com/bphoto/ya6gjD4BPlxe7AKMj_5WsA/o.jpg",
"is_closed": false,
"url": "https://www.yelp.com/biz/girl-and-the-goat-chicago",
"review_count": 7955,
"categories": [
{
"alias": "newamerican",
"title": "American (New)"
},
{
"alias": "bakeries",
"title": "Bakeries"
},
{
"alias": "coffee",
"title": "Coffee & Tea"
}
],
"rating": 4.5,
"coordinates": {
"latitude": 41.884176,
"longitude": -87.6479440725005
},
"transactions": [],
"price": "$$$",
"location": {
"address1": "809 W Randolph St",
"address2": "",
"address3": "",
"city": "Chicago",
"zip_code": "60607",
"country": "US",
"state": "IL",
"display_address": [
"809 W Randolph St",
"Chicago, IL 60607"
]
},
"phone": "+13124926262",
"display_phone": "(312) 492-6262",
"distance": 3396.525042772812
},
{
"id": "cKZNbMvoqJaUe7n6lf6i7w",
"alias": "wildberry-pancakes-and-cafe-chicago-2",
"name": "Wildberry Pancakes and Cafe",
"image_url": "https://s3-media2.fl.yelpcdn.com/bphoto/9ZnC8R_MgeIKWV-5IedwNg/o.jpg",
"is_closed": false,
"url": "https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2",
"review_count": 5954,
"categories": [
{
"alias": "pancakes",
"title": "Pancakes"
}
],
"rating": 4.5,
"coordinates": {
"latitude": 41.884668,
"longitude": -87.62288
},
"transactions": [
"pickup"
],
"price": "$$",
"location": {
"address1": "130 E Randolph St",
"address2": "",
"address3": "",
"city": "Chicago",
"zip_code": "60601",
"country": "US",
"state": "IL",
"display_address": [
"130 E Randolph St",
"Chicago, IL 60601"
]
},
"phone": "+13129389777",
"display_phone": "(312) 938-9777",
"distance": 5082.221755403031
}
],
"region": {
"center": {
"latitude": 37.767413217936834,
"longitude": -122.42820739746094
}
}
}'''
print(parse_api_response(data))

['https://www.yelp.com/biz/girl-and-the-goat-chicago', 'https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2']


As we can see, JSON is quite trivial to parse (which is not the case with HTML as we will see in a second) and work with programmatically. This is why it is one of the most ubiquitous data serialization formats (especially for ReSTful APIs) and a huge benefit of working with a well defined API if one exists. But APIs do not always exists or provide the data we might need, and as a last resort we can always scrape web pages...

## Working with Web Pages (and HTML)

Think of APIs as similar to accessing an application's database itself (something you can interactively query and receive structured data back). But the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query). This is a benefit _AND_ a drawback depending on the end use case. For data science and _programatic_ analysis this raw form is quite ideal, but for an end user requesting information from a _graphical interface_ (like a web browser) this is very far from ideal since it takes some cognitive overhead to interpret the raw information. And vice versa, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

> As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

Going back to the "hello world" example of question 1 with the NYT, we will do something similar to retrieve the HTML of the Yelp site itself (rather than going through the API) programmatically as text. 

## Q3: Parse a Yelp restaurant Page

Using `BeautifulSoup`, parse the HTML of a single Yelp restaurant page to extract the reviews in a structured form as well as the URL to the next page of reviews (or `None` if it is the last page). Fill in following function stubs to parse a single page of reviews and return:
* the reviews as a structured Python dictionary
* the HTML element containing the link/url for the next page of reviews (or None).

For each review be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values:

```python
{
    'user_id': str
    'rating': float
    'date': str ('yyyy-mm-dd')
    'text': str
}

# Example
{
    'user_id': '6789'
    'rating': 4.7
    'date': '2016-01-23'
    'text': "Wonderful!"
}
```

There can be issues with Beautiful Soup using various parsers, for maximum conpatibility (and fewest errors) initialize the library with the default (and Python standard library parser): `BeautifulSoup(markup, "html.parser")`.

Most of the function has been provided to you:

In [12]:
def parse_page(html):
    """
    Parse the reviews on a single page of a restaurant.
    
    Args:
        html (string): String of HTML corresponding to a Yelp restaurant

    Returns:
        tuple(list, string): a tuple of two elements
            first element: list of dictionaries corresponding to the extracted review information
            second element: URL for the next page of reviews (or None if it is the last page)
    """
    soup = BeautifulSoup(html,'html.parser')
    #print(soup.prettify())
    url_next = soup.find('link',rel='next')
    if url_next:
        url_next = url_next.get('href')
    else:
        url_next = None

    reviews = soup.find_all('div', itemprop="review")
    
    reviews_list = []
    for r in reviews:
        
        #[YOUR CODE HERE]
        author = r.find("meta",  itemprop="author")["content"]
        rating = r.find("meta", itemprop="ratingValue")["content"]
        date = r.find("meta", itemprop="datePublished")["content"]
        text = r.p.text
        reviews_list.append({'user_id': str(author), 'rating': float(rating), 'date': str (date), 'text': str(text)}) 
        
    #print(reviews_list)
    return reviews_list, url_next

## Q 3.5: Extract all of the Yelp reviews for a Single Restaurant

So now that we have parsed a single page, and figured out a method to go from one page to the next we are ready to combine these two techniques and actually crawl through web pages! 

Using `requests`, programmatically retrieve __ALL__ of the reviews for a __single__ restaurant (provided as a parameter). Just like the API was paginated, the HTML paginates its reviews (it would be a very long web page to show 300 reviews on a single page) and to get all the reviews you will need to parse and traverse the HTML. As input your function will receive a URL corresponding to a Yelp restaurant. As output return a list of dictionaries (structured the same as question 3) containing the relevant information from the reviews. You can use `parse_page()` here.

In [13]:
def extract_reviews(url):
    """
    Retrieve ALL of the reviews for a single restaurant on Yelp.

    Parameters:
        url (string): Yelp URL corresponding to the restaurant of interest.

    Returns:
        reviews (list): list of dictionaries containing extracted review information
    """

    #[YOUR CODE HERE]
    reviews = []
    resp = requests.get(url)
    rev,url_next = parse_page(resp.text)
    for review in rev:
        reviews.append(review)
    while url_next is not None:
        resp = requests.get(url_next)
        rev,url_next = parse_page(resp.text)
        for review in rev:
            reviews.append(review)
    return reviews

You can test your function with this code:

In [14]:
data = extract_reviews('https://www.yelp.com/biz/the-jibarito-stop-chicago-2')
print (len(data))
# 212
print (data)
# {'user_id': 'Omar D.', 'rating': '5.0', 'date': '2018-12-23', 'text': 'We ordered a large tray of arroz con gandules...'}

218
[{'user_id': 'Ajia A.', 'rating': 5.0, 'date': '2019-02-07', 'text': "I can't believe it took me this long to try this place out.  Authentic Puerto Rican food is hard to come by honestly.   However, I don't really think you will be able to get better, unless you are Puerto Rican and cooking that good ish at home.\n\nThe arroz con gandules was flavorful and authentic, and I also had the half steak Jibarito which was yummy.  Also grab all the sauces for extra goodness...ok I also tried the shrimp empanada which was pretty good as well (but if would go with the jibarito sandwiches over an empanada plate per the menu) .    \n\nDefinitely one of my new go-to spots since I live just around the corner!\n"}, {'user_id': 'Omar D.', 'rating': 5.0, 'date': '2018-12-23', 'text': 'We ordered a large tray of arroz con gandules for our annual holiday party and the reviews by our guests where unanimous...best they have had in Chicago. \n\nWe also go there about once per month for a Jibarito. Final

# Submission

You're almost done! 

After executing all commands and completing this notebook, save your lab2.ipynb as a pdf file and upload it to Gradescope under "Homework 1 - lab 2 (written)". Make sure you check that your pdf file includes all parts of your solution. We recommend using the browser (not jupyter) for saving the pdf. For Chrome on a Mac, this is under File->Print...->Open PDF in Preview and when the PDF opens in Preview you can use Save... to save it. This part will be graded based on completion and it constitutes 30% of HW 1.

Next, you need to copy the functions from Questions 1, 2, 2.5, 3, and 3.5 to a file named *lab2.py*. You also execute *get_json.py* which creates five json files that we will use for the tests. In order to get full points for this homework assignment, you need to pass all test cases that we will run against your *lab2.py* (and not the notebook) on Gradescope. To check whether your code runs locally, run the two tests in *tests_sample* from your command line.

Place your files *lab2.py*, *lab2.ipynb*, and five json files in a zip file and upload the zip file to Gradescope under "Homework 1 - lab 2 (code)" (**do not place you api key file in this zip file and do not submit it, this is your personal developer key**). This part will be graded based on passing the tests on Gradescope and it constitutes 20% of HW 1.

You can submit it as many times as you would like. We will only consider your last submission. If your last submission is after the deadline, the late homework policy applies.