# Yelp API - Lab



## Introduction 

Now that we've seen how the Yelp API works, it's time to put those API and Pandas skills to work in order to do some basic business analysis! Taking things a step further, you'll also independently explore how to perform pagination in order to retrieve a full results set from the Yelp API!

## Objectives

You will be able to:
* Create HTTP requests to get data from Yelp API
* Parse HTTP responses and save the information in a csv
* Perform pagination to retrieve troves of data!
* Write Pandas code to answer questions about your data 

## Problem Introduction

For this lab you will analyze the yelp data for a group of businesses to learn more about an industry. You will choose a type of business (Italian Restuarants, Nail Salons, Crossfit gyms) and a location to analyze. Then you will get data from the Yelp API, store that data in a SQL Database on AWS, and write queries to answer questions about the data. 


### Process:

1. Read through the data questions and the API documentation to determine which pieces of information you need to pull from the Yelp API.

2. Plan out what . One for the businesses and one for the reviews.

3. Create code to:
  - Perform a search of businesses using pagination
  - Parse the API response for specific data points
  - Save the data you pull as a csv

4. Use the functions above in a loop that will paginate over the results to retrieve all of the results. 

5. Create functions to:
  - Retrieve the reviews data of one business
  - Parse the reviews response for specific review data
  - Save the data you pull as a csv

6. Take all of the business IDs from your business search, and  using the 3 Python functions you've created, run your business IDs through a loop to get the reviews for each business and save them in a csv.

7. Write Pandas code to answer the following questions about your data.


Bonus Steps:  
- Place your helper functions in a package so that your final notebook only has the major steps listed.
- Rewrite your business search functions to be able take an argument for the type of business you are searching for.
- Add another group of businesses to your files.


 
## Data Questions:

- Which are the 5 most reviewed businesses?
- What is the highest rating recieved in your data set and how many businesses have that rating?
- What percentage of businesses have a rating greater than or  4.5?
- What percentage of businesses have a rating less than 3?
- What is the average rating of restaurants that have a price label of one dollar sign? Two dollar signs? Three dollar signs? 
- Return the text of the reviews for the most reviewed restaurant. 
- Return the name of the business with the most recent review. 
- Find the highest rated business and return text of the most recent review. If multiple business have the same rating, select the restaurant with the most reviews. 
- Find the lowest rated business and return text of the most recent review.  If multiple business have the same rating, select the restaurant with the least reviews. 


## Part 1 - Understanding your data and question

Lok at the question and determine what data you will need to store in your database in order to answer the questions. Start to think about what tables will you want to create and what columns will you ahve for those tables. 

Look at the API documentation, and determine what fields of the API response you will match up with the columns you want in your Pandas Dataframes. 


https://www.yelp.com/developers/documentation/v3/get_started

## Part 2 - Create ETL pipeline for the business data from the API

Now that you know what data you need from the API, you want to write code that will execute a api call, parse those results and then insert the results into the DB.  

It is helpful to break this up into three different functions (*api call, parse results, and insert into DB*) and then you can write a function/script that pull the other three functions together. 

Let's first do this for the Business endpoint.

In [50]:

import requests 
import pandas as pd 
import json 
import numpy as np

In [143]:
# Write a function to make a call to the yelp API

urlb =  'https://api.yelp.com/v3/businesses/search'
urlr= 'https://api.yelp.com/v3/businesses/JV5oa5-KGdiWnqrKPoxSug/reviews'

client_id = 'zeqH9imxF5PqY_0x0vOpxQ'
api_key = 'l6DUOcw3_ao0vYc7_9xinG-koULX-eLNq8Y8DP0FhvEd6f3w5iSQM7hSJMiezPVyCGGHExcQASSu7A6mWFSybV1wu8T3A9fOL5iMHUdKwrwYA2G4-ii6vktAEZWPXnYx'

In [3]:
headers = {
        'Authorization': 'Bearer {}'.format(api_key),
    }


In [4]:
term = 'Bagels'
location = 'New York City'


url_params = {
                "term": term.replace(' ', '+'),
                "location": location.replace(' ', '+'),
                "limit": 50
            }


In [7]:
def yelp_call(url_params, api_key):
    # your code to make the yelp call
    headers = {
        'Authorization': 'Bearer {}'.format(api_key),
    }
    response = requests.get(urlb, headers=headers, params=url_params)
    return json.loads(response.text)

Bagel_Businesses=yelp_call(url_params,api_key)

In [17]:
Bagel_Businesses['businesses'][2]

{'id': 'JV5oa5-KGdiWnqrKPoxSug',
 'alias': 'absolute-bagels-new-york',
 'name': 'Absolute Bagels',
 'image_url': 'https://s3-media2.fl.yelpcdn.com/bphoto/GjaYFLA8G7IEFiFLKgL_Ng/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/absolute-bagels-new-york?adjust_creative=zeqH9imxF5PqY_0x0vOpxQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=zeqH9imxF5PqY_0x0vOpxQ',
 'review_count': 1391,
 'categories': [{'alias': 'bakeries', 'title': 'Bakeries'},
  {'alias': 'bagels', 'title': 'Bagels'}],
 'rating': 4.5,
 'coordinates': {'latitude': 40.80251, 'longitude': -73.96745},
 'transactions': [],
 'price': '$',
 'location': {'address1': '2788 Broadway',
  'address2': '',
  'address3': '',
  'city': 'New York',
  'zip_code': '10025',
  'country': 'US',
  'state': 'NY',
  'display_address': ['2788 Broadway', 'New York, NY 10025']},
 'phone': '+12129322052',
 'display_phone': '(212) 932-2052',
 'distance': 11028.952385526974}

In [3]:
# write a function to parse the API response 
# so that you can easily insert the data in to the DB

In [75]:
def parse_results(results):
    # your code to parse the result to make them easier to insert into the DB
    parsed_result =[]
    for business in results:
        if 'price' in business:
            biz_list = [business['id'],business['name'],business['alias'],business['rating'],business['review_count'],len(business['price'])]
            parsed_result.append(biz_list)
        else:
            biz_list = [business['id'],business['name'],business['alias'],business['rating'],business['review_count'],np.nan]
            parsed_result.append(biz_list)
            
    return parsed_result

In [33]:
x=parse_results(Bagel_Businesses['businesses'])
pd.DataFrame(x)

Unnamed: 0,0,1,2,3,4,5
0,j1S3NUrkB3BVT49n_e76NQ,Best Bagel & Coffee,best-bagel-and-coffee-new-york,4.5,3305,$
1,VrCCr45dhN-RRM107iptdg,Russ & Daughters,russ-and-daughters-new-york,4.5,2210,$$
2,JV5oa5-KGdiWnqrKPoxSug,Absolute Bagels,absolute-bagels-new-york,4.5,1391,$
3,c3eMI4_o4dPDDhPV_ibBYQ,Ess-a-Bagel,ess-a-bagel-new-york,4.0,3581,$
4,oi39VAwo4-KGm_gSkWPCsQ,Tompkins Square Bagels - Avenue A,tompkins-square-bagels-avenue-a-new-york,4.0,1091,$
5,foO2N-TrdPBO-dFn6M35TA,Brooklyn Bagel & Coffee Company,brooklyn-bagel-and-coffee-company-new-york-8,4.5,111,$
6,Gpc2-sPCXlIQUrkfi4bpzw,Court Street Bagels,court-street-bagels-brooklyn,4.0,201,$
7,Ys9iYSjuFDpZj7B1X07v5g,Liberty Bagels Midtown,liberty-bagels-midtown-new-york-3,4.5,482,$
8,U43uuWwSyH95gXsUhejd4w,La Bagel Delight at Dumbo,la-bagel-delight-at-dumbo-brooklyn,3.5,145,$
9,lQ7H-COT5duZQQ0XqGFPDg,Smith St Bagels,smith-st-bagels-brooklyn,4.0,373,$$


In [21]:
Bagel_Businesses['total']

6600

In [119]:
# Write a function to take your parsed data and insert it into CSV
columns = ['id','name','alias','rating','review_count','price']
df = pd.DataFrame(columns=columns) #blank df
df.to_csv('Bagels_Businesses.csv')

In [101]:
def data_save(parsed_results, csv_filename):
    # your code to save the current results with all of the other results. 
    # I would save the data every time you pull 50 results
    # in case something breaks in the process.
    #reads in blank csv
    existing=pd.read_csv(csv_filename,index_col=0)
    #50 at a time DF
    new = pd.DataFrame(parsed_results,columns=columns)
    df = pd.concat([existing,new])
    df.to_csv(csv_filename)
     
    
    

In [123]:
# Write a script that combines the three functions above into a single process.

# create a variable  to keep track of which result you are in. 
cur = 0

#set up a while loop to go through and grab the result 
while cur < 1000:
    #set the offset parameter to be where you currently are in the results 
    url_params['offset'] = cur
    #make your API call with the new offset number
    results = yelp_call(url_params, api_key)
    
    #after you get your results you can now use your function to parse those results
    parsed_results = parse_results(results['businesses'])
    
    
    # use your function to insert your parsed results into the db
    data_save(parsed_results, 'Bagels_Businesses.csv')
    #increment the counter by 50 to move on to the next results
    cur += 50
    

In [125]:
dfb=pd.read_csv('Bagels_Businesses.csv') #could of put index_col=0



dfb = dfb.drop(columns='Unnamed: 0')
dfb.shape

dfb.head()

Unnamed: 0,id,name,alias,rating,review_count,price
0,VrCCr45dhN-RRM107iptdg,Russ & Daughters,russ-and-daughters-new-york,4.5,2210,2.0
1,j1S3NUrkB3BVT49n_e76NQ,Best Bagel & Coffee,best-bagel-and-coffee-new-york,4.5,3305,1.0
2,JV5oa5-KGdiWnqrKPoxSug,Absolute Bagels,absolute-bagels-new-york,4.5,1391,1.0
3,c3eMI4_o4dPDDhPV_ibBYQ,Ess-a-Bagel,ess-a-bagel-new-york,4.0,3581,1.0
4,oi39VAwo4-KGm_gSkWPCsQ,Tompkins Square Bagels - Avenue A,tompkins-square-bagels-avenue-a-new-york,4.0,1091,1.0


## Part 4 -  Create ETL pipeline for the restaurant review data from the API

You've done this for the Businesses, now you need to do this for reviews. You will follow the same process, but your functions will be specific to reviews.

In [202]:
business_ids =[]

for x in dfb['id']:
    business_ids.append(x)
    
len(business_ids)

1000

In [167]:
client_id2 = '-uqWDK-AroyJM7jqJ5zP8Q'
api_key2 = 'DyHgCLzELZZrIeI5pbvaXjz97CihKM-_aUVH3Ga8EYeTMLFmV59fm4VEO77af57qXjwGAjxIfYCODheuvhrw67YBnDwVI9kqWHs62sjL5HGzwXESA-HmL7twoZaPXnYx'

In [224]:
# write Pandas code to pull back all of the business ids 
# you will need these ids to pull back the reviews for each restaurant
urlr= 'https://api.yelp.com/v3/businesses/JV5oa5-KGdiWnqrKPoxSug/reviews'

bagel_reviews=[]
def yelp_call_reviews(biz_id,url_params, api_key):
    #for x in business_ids[0:5]:
        
    # your code to make the yelp call
    headers = {
        'Authorization': 'Bearer {}'.format(api_key),
    }
    response = requests.get('https://api.yelp.com/v3/businesses/{}/reviews'.format(biz_id), headers=headers, params=url_params)
    #bagel_reviews.append(json.loads(response.text))
    
    return json.loads(response.text)
        



In [228]:
A = yelp_call_reviews(business_ids[0],url_params, api_key)

A['reviews']

[{'id': 'kzSOeEnWcFuhdhlv9lAmVA',
  'url': 'https://www.yelp.com/biz/russ-and-daughters-new-york?adjust_creative=zeqH9imxF5PqY_0x0vOpxQ&hrid=kzSOeEnWcFuhdhlv9lAmVA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_reviews&utm_source=zeqH9imxF5PqY_0x0vOpxQ',
  'text': 'The minute we flew into New York, we dropped our luggage off and headed off to Russ and Daughters for breakfast. I was truly excited to have my first meal...',
  'rating': 5,
  'time_created': '2020-03-14 21:30:40',
  'user': {'id': 'rPbbZQHAl2C2oqg0mGCujw',
   'profile_url': 'https://www.yelp.com/user_details?userid=rPbbZQHAl2C2oqg0mGCujw',
   'image_url': 'https://s3-media3.fl.yelpcdn.com/photo/KBNji-8Pc5i2iOpEuMYg3w/o.jpg',
   'name': 'Joy A.'}},
 {'id': 'z0goRuyMaIOPF3BDXIzoXQ',
  'url': 'https://www.yelp.com/biz/russ-and-daughters-new-york?adjust_creative=zeqH9imxF5PqY_0x0vOpxQ&hrid=z0goRuyMaIOPF3BDXIzoXQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_reviews&utm_source=zeqH9imxF5PqY_0x0vOpxQ',
  'text': "Pret

In [267]:
def parse_results_reviews(biz_id, results):
    # your code to parse the result to make them easier to insert into the DB
    parsed_result =[]
    for review in results:
            review_list = [biz_id,review['text'],review['time_created']]
            parsed_result.append(review_list)
    return parsed_result

In [268]:
parse_results_reviews(business_ids[0], A['reviews'])

[['VrCCr45dhN-RRM107iptdg',
  'The minute we flew into New York, we dropped our luggage off and headed off to Russ and Daughters for breakfast. I was truly excited to have my first meal...',
  '2020-03-14 21:30:40'],
 ['VrCCr45dhN-RRM107iptdg',
  "Pretty tasty stuff here! Loved the pastrami cured salmon and the caviar cream cheese.\n\nLatkes are just ok to me but that's probably because they come in a...",
  '2020-02-27 13:38:29'],
 ['VrCCr45dhN-RRM107iptdg',
  'Great place. The decor is a throwback to when the diner was founded in the early 1900s. Beautifully yet simply decorated with interesting lighting fixtures....',
  '2020-02-25 03:13:45']]

In [274]:
column_reviews = ['id','text','time_created']
dfr = pd.DataFrame(columns=column_reviews) #blank df
dfr.to_csv('bagel_reviews.csv')

In [269]:
def data_save_reviews(parsed_results, csv_filename):
    # your code to save the current results with all of the other results. 
    # I would save the data every time you pull 50 results
    # in case something breaks in the process.
    existing=pd.read_csv(csv_filename,index_col=0)
    new = pd.DataFrame(parsed_results,columns=column_reviews)
    df = pd.concat([existing,new])
    df.reset_index()
    df.to_csv(csv_filename)
     
    

In [275]:
# Write a script that combines the three functions above into a single process.

# create a variable  to keep track of which result you are in. 


#set up a while loop to go through and grab the result 
for biz_id in business_ids:
    #set the offset parameter to be where you currently are in the results 
    #url_params['offset'] = cur
    #make your API call with the new offset number
    results = yelp_call_reviews(biz_id,url_params, api_key)
    if 'reviews' in results:
    
    #after you get your results you can now use your function to parse those results
        parsed_results = parse_results_reviews(biz_id, results['reviews'])
    
    
    # use your function to insert your parsed results into the db
        data_save_reviews(parsed_results, 'bagel_reviews.csv')
    #increment the counter by 50 to move on to the next results
    
    
    

In [280]:
dfr=pd.read_csv('bagel_reviews.csv')

dfr=dfr.drop(columns=['Unnamed: 0'])


In [286]:
dfr.head()

reviews_and_rating=pd.merge(dfr,dfb,on='id')

reviews_and_rating.head()

Unnamed: 0,id,text,time_created,name,alias,rating,review_count,price
0,VrCCr45dhN-RRM107iptdg,"The minute we flew into New York, we dropped o...",2020-03-14 21:30:40,Russ & Daughters,russ-and-daughters-new-york,4.5,2210,2.0
1,VrCCr45dhN-RRM107iptdg,Pretty tasty stuff here! Loved the pastrami cu...,2020-02-27 13:38:29,Russ & Daughters,russ-and-daughters-new-york,4.5,2210,2.0
2,VrCCr45dhN-RRM107iptdg,Great place. The decor is a throwback to when ...,2020-02-25 03:13:45,Russ & Daughters,russ-and-daughters-new-york,4.5,2210,2.0
3,j1S3NUrkB3BVT49n_e76NQ,Like the name implies.... BEST BAGEL & Coffee ...,2020-03-24 21:10:24,Best Bagel & Coffee,best-bagel-and-coffee-new-york,4.5,3305,1.0
4,j1S3NUrkB3BVT49n_e76NQ,"Definitely best bagel, however, the coffee is ...",2020-03-17 07:57:35,Best Bagel & Coffee,best-bagel-and-coffee-new-york,4.5,3305,1.0


In [None]:
# write a function to insert the parsed data into the reviews table

In [None]:
# combine the functions above into a single script  

## Part 5 -  Write Pandas code that will answer the questions posed. 

Now that your data is saved in CSVs, you can answer the questions. 

In [129]:
#1 

dfb.sort_values(by='review_count',ascending=False).head(5)

Unnamed: 0,id,name,alias,rating,review_count,price
398,V7lXZKBDzScDeGB8JmnzSA,Katz's Delicatessen,katzs-delicatessen-new-york,4.0,12206,2.0
109,H4jJ7XB3CetIr1pg56CczQ,Levain Bakery,levain-bakery-new-york,4.5,8029,2.0
720,nU4XBdvxDABXqZ6CnB8Dig,Clinton Street Baking Company,clinton-street-baking-company-new-york-5,4.0,5064,2.0
216,WHRHK3S1mQc3PmhwsGRvbw,Bibble & Sip,bibble-and-sip-new-york-2,4.5,4780,1.0
794,U5hCNNyJmb7f3dmC1HTzSQ,Junior's Restaurant & Bakery - 45th St.,juniors-restaurant-and-bakery-45th-st-new-york,4.0,4619,2.0


In [133]:
#2 = 57
dfb.sort_values(by='rating',ascending=False)

dfb.loc[dfb['rating'] == 5.0].shape

(57, 6)

In [136]:
#3 - 283 have ratings >= 4.5 - 28.30%
dfb.loc[dfb['rating'] >= 4.5].shape

answer= (283/1000)*100

answer 

28.299999999999997

In [139]:
#4 - 5.4%

dfb.loc[dfb['rating'] < 3].shape

answer2=(54/1000)*100

answer2

5.4

In [183]:
#5
dfb.groupby('price')['rating'].agg('mean')

price
1.0    3.803390
2.0    3.874477
3.0    3.625000
4.0    2.000000
Name: rating, dtype: float64

In [298]:
#6 
#find business with most reviews
#take id of most reviewed business 
#use id to pull from reviews DF
#return text of 3 reviews

top_review = reviews_and_rating.sort_values(by='review_count',ascending=False).head(1)

top_review['text']

1183    One of my favorite places. \nThis was my third...
Name: text, dtype: object

In [301]:
#7 
#for reviews df, convert date to date/timestamp
#sort for most recent timestamp
#match id in business df

reviews_and_rating.sort_values(by='time_created',ascending=False).head(1)['name']

2127    Long Island Bagel Cafe - Long Beach
Name: name, dtype: object

In [319]:
#8

highest_rated=reviews_and_rating.loc[reviews_and_rating['rating'] == 5.0].sort_values(by='review_count',ascending=False)
b=highest_rated.loc[highest_rated['name']=='Lella Alimentari'].sort_values(by='time_created',ascending=False).head(1)
for x in b['text']:
    print(x)

This is by far my favorite neighborhood coffee shop. When I first moved to NYC and didn't have a job, I would go there for hours and drink coffee while...


In [348]:
#pd.options.display.max_colwidth = 2000
pd.set_option('display.width', 1000)

In [349]:
#9

a=reviews_and_rating.sort_values(by='rating').head().sort_values(by= 'time_created',ascending=False).head(1)

a['text']

965    Terrible service as well as overpriced. My partner & I (Dept of Sanitation workers) ordered sandwiches DURING the Covid-19 pandemic no less & after a 10...
Name: text, dtype: object

# Extra Reference help

###  Pagination

Returning to the Yelp API, the [documentation](https://www.yelp.com/developers/documentation/v3/business_search) also provides us details regarding the API limits. These often include details about the number of requests a user is allowed to make within a specified time limit and the maximum number of results to be returned. In this case, we are told that any request has a maximum of 50 results per request and defaults to 20. Furthermore, any search will be limited to a total of 1000 results. To retrieve all 1000 of these results, we would have to page through the results piece by piece, retriving 50 at a time. Processes such as these are often refered to as pagination.

Now that you have an initial response, you can examine the contents of the json container. For example, you might start with ```response.json().keys()```. Here, you'll see a key for `'total'`, which tells you the full number of matching results given your query parameters. Write a loop (or ideally a function) which then makes successive API calls using the offset parameter to retrieve all of the results (or 5000 for a particularly large result set) for the original query. As you do this, be mindful of how you store the data. 

**Note: be mindful of the API rate limits. You can only make 5000 requests per day, and APIs can make requests too fast. Start prototyping small before running a loop that could be faulty. You can also use time.sleep(n) to add delays. For more details see https://www.yelp.com/developers/documentation/v3/rate_limiting.**

***Below is sample code that you can use to help you deal with the pagination parameter and bring all of the functions together.***


***Also, something might cause your code to break while it is running. You don't want to constantly repull the same data when this happens, so you should insert the data into the database as you call and parse it, not after you have all of the data***


In [None]:
# create a variable  to keep track of which result you are in. 
cur = 0

#set up a while loop to go through and grab the result 
while cur < num and cur < 1000:
    #set the offset parameter to be where you currently are in the results 
    url_params['offset'] = cur
    #make your API call with the new offset number
    results = yelp_call(url_params, api_key)
    
    #after you get your results you can now use your function to parse those results
    parsed_results = parse_results(results)
    
    # use your function to insert your parsed results into the db
    data_save(parsed_results)
    #increment the counter by 50 to move on to the next results
    cur += 50