# Getting Your Data From Yelp!

In order to make sure you are on track to completing the project, you will complete this workbook first. Below are steps that you need to take in order to make sure you have your data from yelp and are ready to analyze it. Your cohort lead will review this workbook with you the Wednesday before your project is due.    

# Part 1 - Understanding your data and question

You will be pulling data from the Yelp API to complete your analysis. The API, however, provides you with a lot of information that will not be pertinent to your analysis. You will pull data from the API and parse through it to keep only the data that you will need. In order to help you identify that information,look at the API documentation and understand what data the API will provide you. 

Identify which data fields you will want to keep for your analysis. 

https://www.yelp.com/developers/documentation/v3/get_started

# Part 2 - Create ETL pipeline for the business data from the API

## Details

Now that you know what data you need from the API, you want to write code that will execute an API call, parse those results and then insert the results into the DB.  

It is helpful to break this up into three different functions (*API call, parse results, and insert into DB*) and then you can write a function/script that pull the other three functions together. 

Let's first do this for the Business endpoint.

## Request

### Imports and Setup

In [35]:
import requests
import pandas as pd
import json
import csv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [36]:
with open('.secrets.json') as f:
    keys = json.load(f)

client_id = keys['Client_ID']
yelp_key = keys['api_key']

### ƒ: yelp_request

 - Params: search term (eg. "wineries); location; yelp_key variable (from Imports); and changing setting to print details

In [37]:
# def yelp_request(term, location, yelp_key, verbose=True):
#     '''Adapted from Yelp API Lab: https://github.com/BenJMcCarty/dsc-yelp-api-lab/tree/solution'''
    
#     url = 'https://api.yelp.com/v3/businesses/search'

#     headers = {
#             'Authorization': 'Bearer {}'.format(yelp_key),
#         }

#     url_params = {
#                     'term': term.replace(' ', '+'),
#                     'location': location.replace(' ', '+'),
#                     'limit': 50
#                 }
#     response = requests.get(url, headers=headers, params=url_params)
    
#     if verbose == True:
#         print(response)
#         print(type(response.text))
#         print(response.text[:1000])
        
#     return response.json()

In [38]:
## Adapted from code generated by ChatGPT - revised original function.

def yelp_request(term: str, location: str, yelp_key: str, verbose: bool = True) -> dict:
    """
    Make a request to the Yelp API to search for businesses.

    Parameters:
    - term (str): The search term to query.
    - location (str): The location to search within.
    - yelp_key (str): Your Yelp API key.
    - verbose (bool): Whether to print verbose output. Default is True.

    Returns:
    - dict: A dictionary containing the JSON response from the Yelp API.
    """

    url = 'https://api.yelp.com/v3/businesses/search'

    headers = {
        'Authorization': f'Bearer {yelp_key}',
    }

    url_params = {
        'term': term.replace(' ', '+'),
        'location': location.replace(' ', '+'),
        'limit': 50
    }

    try:
        response = requests.get(url, headers=headers, params=url_params)
        response.raise_for_status()  # Raise an exception for 4XX or 5XX status codes
        if verbose:
            print(response)
            print(type(response.text))
            print(response.text[:1000])
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return {}


### Sending the request


- Un-comment the next line to run the response

In [39]:
# response = yelp_request('winery','Southern California', yelp_key)
# response.keys()

### Identifying and Exploring Keys

In [40]:
# # Identify keys

# print(response.keys())

#### Exploring the "Businesses" Key

In [41]:
# response['businesses']

In [42]:
# # Show first item w/in list of businesses

# response['businesses'][0]

In [43]:
# response['businesses'][0]['categories'][0]['alias']

In [44]:
# response['businesses'][0]['categories'][0]['title']

#### Exploring the "Total" Key

In [45]:
# response['total']

# # How many businesses are there in total for my request?

#### Exploring the "Region" Key

In [46]:
# response['region']

# # From which geographical area will my results come?

## Parse

### ƒ: parse_data

In [47]:
# def parse_data(list_of_data):
#     '''Adapted from Tyrell's code'''  

#     # Create empty list to store results
    
#     parsed_data = []
    
#     # Loop through each business in the list of businesses
#     # Add specific k:v pairs to a dictionary
    
#     for business in list_of_data:
#         if 'price' not in business:
#             business['price'] = np.nan
            
#             # Verify that the "price" key is in the selected business dict
            
#         details = {'name': business['name'],
#                      'location': ' '.join(business['location']['display_address']),
#                      'Business ID': business['id'],
#                      'alias': business['categories'][0]['alias'],
#                      'title': business['categories'][0]['title'],
#                      'rating': business['rating'],
#                      'review_count': business['review_count'],
#                      'price': business['price'],
#                      'latitude': business['coordinates']['latitude'],
#                      'longitude': business['coordinates']['longitude']
#                     }
#         # Add the new dictionary to the previous list
        
#         parsed_data.append(details)
    
#     # Create a DataFrame from the resulting list
    
#     df_parsed_data = pd.DataFrame(parsed_data)

    
#     return df_parsed_data

In [48]:
## Adapted from code generated by ChatGPT - revised original function.

def parse_data(list_of_data: list) -> pd.DataFrame:
    """
    Parse a list of business data into a DataFrame.

    Parameters:
    - list_of_data (list): A list containing dictionaries of business data.

    Returns:
    - pd.DataFrame: A DataFrame containing parsed business data.
    """

    parsed_data = []

    for business in list_of_data:
        # Handle missing or nested keys gracefully
        price = business.get('price', np.nan)

        # Avoid hardcoding category index
        if 'categories' in business and business['categories']:
            category = business['categories'][0]
            alias = category.get('alias', '')
            title = category.get('title', '')
        else:
            alias = ''
            title = ''

        details = {
            'name': business.get('name', ''),
            'location': ' '.join(business['location']['display_address']) if 'location' in business else '',
            'Business ID': business.get('id', ''),
            'alias': alias,
            'title': title,
            'rating': business.get('rating', np.nan),
            'review_count': business.get('review_count', np.nan),
            'price': price,
            'latitude': business['coordinates']['latitude'] if 'coordinates' in business else np.nan,
            'longitude': business['coordinates']['longitude'] if 'coordinates' in business else np.nan
        }
        parsed_data.append(details)

    df_parsed_data = pd.DataFrame(parsed_data)

    return df_parsed_data


## Updating Requests for Pagination

### ƒ: yelp_request_offset

In [49]:
# def yelp_request_offset(term, location, yelp_key, offset=0, verbose=False):
#     '''Adapted from Yelp API Lab: https://github.com/BenJMcCarty/dsc-yelp-api-lab/tree/solution'''
    
#     url = 'https://api.yelp.com/v3/businesses/search'

#     headers = {
#             'Authorization': 'Bearer {}'.format(yelp_key),
#         }

#     url_params = {
#                     'term': term.replace(' ', '+'),
#                     'location': location.replace(' ', '+'),
#                     'limit': 50,
#                     'offset': offset
#                         }
    
#     response = requests.get(url, headers=headers, params=url_params)
    
#     if verbose == True:
#         print(response)
#         print(type(response.text))
#         print(response.text[:1000])
        
#     return response.json()

In [50]:
## Adapted from code generated by ChatGPT - revised original function.

def yelp_request_offset(term: str, location: str, yelp_key: str, offset: int = 0, verbose: bool = False) -> dict:
    """
    Make a request to the Yelp API to search for businesses with an offset.

    Parameters:
    - term (str): The search term to query.
    - location (str): The location to search within.
    - yelp_key (str): Your Yelp API key.
    - offset (int): The offset for paginating results. Default is 0.
    - verbose (bool): Whether to print verbose output. Default is False.

    Returns:
    - dict: A dictionary containing the JSON response from the Yelp API.
    """

    url = 'https://api.yelp.com/v3/businesses/search'

    headers = {
        'Authorization': f'Bearer {yelp_key}',
    }

    url_params = {
        'term': term.replace(' ', '+'),
        'location': location.replace(' ', '+'),
        'limit': 50,
        'offset': offset
    }

    try:
        response = requests.get(url, headers=headers, params=url_params)
        response.raise_for_status()  # Raise an exception for 4XX or 5XX status codes
        if verbose:
            print(response)
            print(type(response.text))
            print(response.text[:1000])
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return {}

#### Test 1

In [51]:
# test1 = yelp_request_offset('winery', 'San Diego', yelp_key)
# test1

In [52]:
# test1.keys()

In [53]:
# test1['total']

In [54]:
# test1['businesses'][0]

# ƒ: GET BUSINESSES (ALL)

In [55]:
# def get_full_data(term, location, yelp_key, file_name = 'data/wineries_raw.csv'):
#     '''Requests all results from Yelp API; saves as a .csv; and returns a DataFrame.'''

#     # Create a .csv to store results
#     blank_df = pd.DataFrame()
#     blank_df.to_csv(file_name)
    
#     # Process first request to Yelp API and calculate number of pages 
#     results = yelp_request_offset(term, location, yelp_key, offset=0, 
#                                   verbose=False)
#     num_pages = results['total']//50+1
    
#     # Print out confirmation feedback
#     print(f'For {term} and {location}: ')
#     print(f"    Total number of results: {results['total']}.")
#     print(f'    Total number of pages: {num_pages}.')
    
#     # Create offset for additional results
#     offset = 0

#     # Retrieves remaining pages
#     for num in range(num_pages-1):
#         try:
#             # Process API request
#             results = yelp_request_offset(term, location, yelp_key,
#                                           offset=offset, verbose=False)
            
#             # From results, take values from "Businesses" key and save
#             parsed_results = parse_data(results['businesses'])
          
#             # Save resulting DF to .csv from top
#             parsed_results.to_csv(file_name, mode='a', index = False)
            
#             # Increase offset to move to next "page" of data
#             offset += 50
            
#         except:
#             # If error, print where the error happens
#             print(f'Error on page {num}.')
#             # Then save the results so far to the .csv
#             parsed_results.to_csv(file_name, mode='a', index = False)


#     return parsed_results

In [56]:
## Adapted from code generated by ChatGPT - revised original function.

def get_full_data(term: str, location: str, yelp_key: str, file_name: str = 'data/wineries_raw.csv') -> pd.DataFrame:
    """
    Requests all results from Yelp API, saves as a .csv, and returns a DataFrame.

    Parameters:
    - term (str): The search term to query.
    - location (str): The location to search within.
    - yelp_key (str): Your Yelp API key.
    - file_name (str): The filename for the CSV file to save the data. Default is 'data/wineries_raw.csv'.

    Returns:
    - pd.DataFrame: A DataFrame containing parsed business data.
    """

    try:
        # Check if the file already exists
        with open(file_name, 'x'):
            pass
    except FileExistsError:
        print(f"File '{file_name}' already exists.")

    # Process first request to Yelp API and calculate number of pages 
    results = yelp_request_offset(term, location, yelp_key, offset=0, verbose=False)
    num_pages = (results.get('total', 0) // 50) + 1
    
    # Print out confirmation feedback
    print(f'For {term} and {location}: ')
    print(f"    Total number of results: {results.get('total', 0)}.")
    print(f'    Total number of pages: {num_pages}.')

    # Create offset for additional results
    offset = 0

    # Retrieves remaining pages
    for num in range(num_pages):
        try:
            # Process API request
            results = yelp_request_offset(term, location, yelp_key, offset=offset, verbose=False)
            
            # From results, take values from "businesses" key and save
            parsed_results = parse_data(results.get('businesses', []))
          
            # Save resulting DF to .csv from top
            with open(file_name, 'a') as file:
                parsed_results.to_csv(file, mode='a', index=False, header=(num == 0))
            
            # Increase offset to move to next "page" of data
            offset += 50
            
        except (KeyError, IndexError) as e:
            # If error, print where the error happens
            print(f'Error on page {num}: {e}')
            # Then break the loop to avoid infinite retries
            break

    return parsed_results

In [57]:
results = get_full_data('hotel', 'Lisbon', yelp_key, file_name = './data/Hotel_Reviews_Lisbon.csv')
results.head()

File './data/Hotel_Reviews_Lisbon.csv' already exists.
For hotel and Lisbon: 
    Total number of results: 269.
    Total number of pages: 6.


Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude
0,Czar Lisbon Hotel,"Av. Almirante Reis, 103 1150-020 Lisbon Portugal",RZHSMuDwO1cQiN9YdOgxUA,hotels,Hotels,4.5,2,,38.729335,-9.134757
1,Largo da Sé Guest House,"Calçada do Correio Velho, nº3 Lisbon Portugal",bakV-SwtQW3svi8YtRxsRw,hotels,Hotels,5.0,1,,38.710529,-9.13417
2,Pensão Rossio,"R. dos Sapateiros, 173 2.° Esq. 1100-577 Lisbo...",CPU8jOLO5qVX1zrep6OCOw,hotels,Hotels,5.0,1,,38.711969,-9.138702
3,Happy @ Santos,Calçada marques de Abrantes 97 1200-719 Lisbon...,EWIVrs6yBFZO0V6tK-F4Jw,hotels,Hotels,3.0,1,,38.707792,-9.154771
4,Shiado Hostel,"R. Anchieta, 5 1200-023 Lisbon Portugal",XXhSVTPVdKXwJPcEG0LEsw,hotels,Hotels,4.0,4,€€,38.709833,-9.141074


# Cleaning Data

# ƒ: clean_data

Identifying, Filtering for, and Saving Top 2 Aliases

In [19]:
# def sort_by_aliases(raw_data = 'data/wineries_raw.csv'):

#     # Read in businesses
#     df1 = pd.read_csv(raw_data, header = 0)

#     # Create new DF filtering alias and title columns
#     df1_alias = df1.loc[:,['alias', 'title']]

#     # Identify top 2 aliases 
#     df1_alias_count = df1_alias.groupby('alias').count().sort_values(['title'],\
#                                                             ascending=False)[:2]

#     # Note: initially tried top 3, but it returned distributors, not wineries

#     df1_alias_count.reset_index(inplace=True)
    
#     print("Top two aliases: ")
#     print(df1_alias_count)

#     # display them as a list
#     aliases_top_2 = df1_alias_count['alias'].tolist()

#     # Selecting rows based on condition and saving

#     df2 = df1[df1['alias'].isin(aliases_top_2)]

#     df2.to_csv('data/wineries_filtered_alias.csv', index = False)
#     print("Saved to 'data/wineries_filtered_alias.csv'")
    
#     return df2

In [15]:
def clean_data(existing_dataframe, raw_data_path, cleaned_data_path):
    '''- Requires data from either an existing dataframe or an existing .csv file
    - Takes raw business data from the Yelp API and filters for the top two
    aliases (focusing on "wineries" and "winetastingrooms").
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data_path, header = 1)

    alias_index = df1['alias'].value_counts()[:2].index
    print(alias_index)
    
    # Filtering rows based on condition

    df2 = df1[df1['alias'].isin(alias_index)]
    
    # Resetting index
    df2.reset_index(drop=True, inplace=True)
    
    # Save results
    df2.to_csv(cleaned_data_path,index = False)
       
    print(f"Saved to {cleaned_data_path}")
    
    return df2

In [16]:
raw_data_path='./data/Hotel_Reviews_Lisbon.csv'
cleaned_data_path='./data/Hotel_Reviews_Lisbon_Cleaned.csv'

df_filtered = clean_data(existing_dataframe = None,
                         raw_data_path = raw_data_path,
                         cleaned_data_path = cleaned_data_path)
df_filtered

Index(['hotels', 'guesthouses'], dtype='object', name='alias')
Saved to ./data/Hotel_Reviews_Lisbon_Cleaned.csv


Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude
0,Hotel Avenida Palace,"R. 1º de Dezembro, 123 1200-359 Lisbon Portugal",q-Px_TUI_zLr93xn-1-Ypw,hotels,Hotels,4.6,23,€€€,38.714837,-9.141188
1,Brown's Central,"R. da Assunção, 75 1100-042 Lisbon Portugal",yvbBx06xhwEyvXITlGji5g,hotels,Hotels,4.6,16,€€,38.7115402,-9.1383696
2,Evolution E.Hotel,"Praça Duque de Saldanha, 4 1050-094 Lisbon Por...",fmoWlWBqkVeyFpTtcE09RQ,hotels,Hotels,4.6,13,€€,38.733519900691,-9.14409564100038
3,Sheraton Lisboa Hotel & Spa,"Rua Latino Coelho, 1 1069-025 Lisbon Portugal",9wTTno6WG7cHlXDDZ00EQw,hotels,Hotels,4.2,53,€€€,38.7318001,-9.1470003
4,Ritz Four Seasons Hotel,"R. Rodrigo da Fonseca, 88 1099-039 Lisbon Port...",DPkYK5RwCkAULtLJp7pOmA,hotels,Hotels,4.6,24,€€€€,38.7255172,-9.1552052
...,...,...,...,...,...,...,...,...,...,...
229,Hotel As Lisboa,"Av. Almirante Reis, 188 1000-055 Lisbon Portugal",XieXK380bk1DgOMzVJrhbQ,hotels,Hotels,2.0,1,€€,38.737595634743,-9.13362236004946
230,Palácio do Governador,"R. Bartolomeu Dias, 117 1400-030 Lisbon Portugal",_ZBCoK0xNKw3NkQ21XArqg,hotels,Hotels,5.0,1,€€€€,38.6950405,-9.2144062
231,Turim Alameda,Av. Rovisco Pais 34 1000-268 Lisbon Portugal,e65ge0YP1I8FRzRSPvA4xQ,hotels,Hotels,4.0,1,,38.7354399,-9.13763
232,Aquapura Douro Valley,Quinta Da Vale De Abraoo Lisbon Portugal,BExuGekqd19zjQccm4u-Cw,hotels,Hotels,4.0,1,,38.7111511,-9.1503601


# Getting Classy - Creating Class to Handle Full Process

---

Code initially generated by ChatGPT and adapted further.

---

In [58]:
class YelpReviewManager:
    def __init__(self, yelp_key):
        """Initialize with Yelp API key."""
        self._yelp_key = yelp_key
        self._base_url = 'https://api.yelp.com/v3/businesses/search'
        self.raw_data = None
        self.transformed_data = None

    def _make_api_request(self, term, location, offset=0):
        """Make a private Yelp API request."""
        headers = {'Authorization': f'Bearer {self._yelp_key}'}
        params = {'term': term, 'location': location, 'limit': 50, 'offset': offset}
        response = requests.get(self._base_url, headers=headers, params=params)
        return response.json()

    def fetch_and_save_raw_data(self, term, location, file_name, verbose=False):
        """Fetch raw data based on term and location, and save it."""
        businesses = []
        offset = 0
        while True:
            response = self._make_api_request(term, location, offset)
            batch = response.get('businesses', [])
            if not batch or 'total' in response and len(businesses) >= response['total']:
                break
            businesses.extend(batch)
            offset += 50
            if verbose:
                print(f"Fetched {len(batch)} businesses, total: {len(businesses)}")
        self.raw_data = pd.DataFrame(businesses)
        self.raw_data.to_csv(file_name, index=False)
        print(f"Raw data saved to {file_name}.")

    def transform_data(self, top_n=2):
        """Transform the raw data by filtering based on top n aliases."""
        if self.raw_data is None:
            raise ValueError("Raw data is not fetched yet.")
        value_counts = self.raw_data['categories'].apply(lambda x: x[0]['alias'] if x else None).value_counts()
        top_aliases = value_counts.nlargest(top_n).index.tolist()
        self.transformed_data = self.raw_data[self.raw_data['categories'].apply(lambda x: x[0]['alias'] if x else None).isin(top_aliases)]

    def save_transformed_data(self, file_name):
        """Save the transformed data to a CSV file."""
        if self.transformed_data is None:
            raise ValueError("Data has not been transformed yet.")
        self.transformed_data.to_csv(file_name, index=False)
        print(f"Transformed data saved to {file_name}.")

In [59]:
# Initialize the YelpReviewManager with your Yelp API key
yelp_review_manager = YelpReviewManager(yelp_key=yelp_key)

# Fetch and save raw data for hotels in Lisbon, Portugal
yelp_review_manager.fetch_and_save_raw_data(term='hotels', location='Lisbon, Portugal', file_name='lisbon_hotels_raw.csv', verbose=True)

# Perform transformation on the fetched data
yelp_review_manager.transform_data(top_n=2)

# Save the transformed data
yelp_review_manager.save_transformed_data(file_name='lisbon_hotels_transformed.csv')

Fetched 50 businesses, total: 50
Fetched 50 businesses, total: 100
Fetched 50 businesses, total: 150
Fetched 50 businesses, total: 200
Fetched 50 businesses, total: 250
Fetched 18 businesses, total: 268
Raw data saved to lisbon_hotels_raw.csv.
Transformed data saved to lisbon_hotels_transformed.csv.


# ƒ: convert_price

In [21]:
def convert_price(dataframe, filepath):
    ''' - Requires a dataframe with the 'price' column elements being NaN, $, $$, or $$$.
    - Takes a pre-existing dataframe and adds a column to store the conversion from $ to an integer.'''
    
    # Converting $s to integers, then saving to new column.
    dataframe['price_converted'] = dataframe.loc[:,'price'].map({np.nan:0, '$':1, '$$':2, '$$$':3})
    
    
    # Saves results to new file
    dataframe.to_csv(filepath,index = False)
    
    return dataframe

In [22]:
convert_price(dataframe = df_filtered, filepath = './data/Hotel_Reviews_Lisbon_Cleaned.csv')

Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,price_converted
0,Hotel Avenida Palace,"R. 1º de Dezembro, 123 1200-359 Lisbon Portugal",q-Px_TUI_zLr93xn-1-Ypw,hotels,Hotels,4.6,23,€€€,38.714837,-9.141188,
1,Brown's Central,"R. da Assunção, 75 1100-042 Lisbon Portugal",yvbBx06xhwEyvXITlGji5g,hotels,Hotels,4.6,16,€€,38.7115402,-9.1383696,
2,Evolution E.Hotel,"Praça Duque de Saldanha, 4 1050-094 Lisbon Por...",fmoWlWBqkVeyFpTtcE09RQ,hotels,Hotels,4.6,13,€€,38.733519900691,-9.14409564100038,
3,Sheraton Lisboa Hotel & Spa,"Rua Latino Coelho, 1 1069-025 Lisbon Portugal",9wTTno6WG7cHlXDDZ00EQw,hotels,Hotels,4.2,53,€€€,38.7318001,-9.1470003,
4,Ritz Four Seasons Hotel,"R. Rodrigo da Fonseca, 88 1099-039 Lisbon Port...",DPkYK5RwCkAULtLJp7pOmA,hotels,Hotels,4.6,24,€€€€,38.7255172,-9.1552052,
...,...,...,...,...,...,...,...,...,...,...,...
229,Hotel As Lisboa,"Av. Almirante Reis, 188 1000-055 Lisbon Portugal",XieXK380bk1DgOMzVJrhbQ,hotels,Hotels,2.0,1,€€,38.737595634743,-9.13362236004946,
230,Palácio do Governador,"R. Bartolomeu Dias, 117 1400-030 Lisbon Portugal",_ZBCoK0xNKw3NkQ21XArqg,hotels,Hotels,5.0,1,€€€€,38.6950405,-9.2144062,
231,Turim Alameda,Av. Rovisco Pais 34 1000-268 Lisbon Portugal,e65ge0YP1I8FRzRSPvA4xQ,hotels,Hotels,4.0,1,,38.7354399,-9.13763,0.0
232,Aquapura Douro Valley,Quinta Da Vale De Abraoo Lisbon Portugal,BExuGekqd19zjQccm4u-Cw,hotels,Hotels,4.0,1,,38.7111511,-9.1503601,0.0


# Part 3 -  Create ETL pipeline for the restaurant review data from the API

You've done this for the Businesses, now you need to do this for reviews. You will follow the same process, but your functions will be specific to reviews. Above you have a model of the functions you will need to write, and how to pull them together in one script. For this part, you have the process below 

## Getting Business IDs

- In order to pull the reveiws, you will need the business ids. So your first step will be to get all of the business ids from your businesses csv. 

### Open file and slice ID

1. Open data/wineries.csv
2. Slice out the 'name' and 'id' columns for each row

In [24]:
df_saved = pd.read_csv("./data/Hotel_Reviews_Lisbon_Cleaned.csv")
df_saved.reset_index(drop=True, inplace=True)

# Slice out IDs, then save them to a list

df_saved_id = df_saved['Business ID'].to_list()
       
len(df_saved_id)

234

## Requesting Reviews

- Write a function that takes a business id and makes a call to the API for reviews.


### ƒ: get_reviews

In [25]:
def get_reviews(business_ID, yelp_key, verbose=False):
    '''Adapted from Yelp API Lab: https://github.com/BenJMcCarty/dsc-yelp-api-lab/tree/solution'''
    
    url = 'https://api.yelp.com/v3/businesses/'+ business_ID + '/reviews'

    headers = {
            'Authorization': 'Bearer {}'.format(yelp_key),
        }

    response = requests.get(url, headers=headers)

    if verbose == True:
        print(response)
        print(type(response.text))
        print(response.text[:1000])


    return response.json()

# ƒ: Parse Reviews for final GET

In [26]:
def parse_reviews(review):
    '''Adapted from Tyrell's code'''  

    
    # Loop through each review in the list of reviews
    # Add specific k:v pairs to a dictionary      
    details = {
        'Reviewer Name': review['user']['name'],
        'Review Rating': review['rating'],
        'Review Text': review['text'],
        'Time Created': review['time_created']
        }


    # Create a DataFrame from the resulting dictionary
    
    df_parsed_reviews = pd.DataFrame.from_dict([details])
   
    return df_parsed_reviews

# ƒ: GET REVIEWS (ALL)

In [27]:
def get_all_reviews(list_of_biz_ids, yelp_key, file_name = 'data/hotel_reviews_raw.csv'):
    '''Requests all review results for given business IDs from Yelp API; \
    saves as a .csv; and returns a DataFrame.'''
    
    # Create a starter empty DataFrame and save to .csv to store data.    
    blank_df = pd.DataFrame(columns= ['Reviewer Name', 'Review Rating', 
                                      'Review Text', 'Time Created', 
                                      'Business ID'])
    blank_df.to_csv(file_name, index = False)
        
    for i in list_of_biz_ids:
        try:
            
            # Process API request for 3 reviews per business:
            raw_reviews = get_reviews(i, yelp_key)

            for review in raw_reviews['reviews']:
                

                # From results, take values from "Businesses" key and save
                parsed_reviews = parse_reviews(review) 

                parsed_reviews['Business ID'] = i
                
                # H2: save results to df
                parsed_reviews.to_csv(file_name, mode='a', index = False,
                                      header = False)

        except:
            # If error, print where the error happens
            print(f'Error on page {num}.')
            # Then save the results so far to the .csv
            parsed_reviews.to_csv(file_name, mode='a', index = False, 
                                  header = False)

    try:
        reviews1 = pd.read_csv(file_name)
        return reviews1
    except:
        return parsed_reviews

#### Test GET REVIEWS

In [28]:
test_all_funct = get_all_reviews(df_saved_id, yelp_key)
test_all_funct

Unnamed: 0,Reviewer Name,Review Rating,Review Text,Time Created,Business ID
0,Tiffany H.,5,This is a luxury high class hotel located in t...,2023-12-16 10:34:05,q-Px_TUI_zLr93xn-1-Ypw
1,Andrew F.,4,"Charming, old-school, centrally located Lisbon...",2023-10-26 06:26:43,q-Px_TUI_zLr93xn-1-Ypw
2,Sheila R.,4,My husband and I stayed here recently for the ...,2022-08-14 12:15:56,q-Px_TUI_zLr93xn-1-Ypw
3,Alli C.,5,"Wow! This hotel is just breathtaking, and serv...",2023-07-18 16:49:06,yvbBx06xhwEyvXITlGji5g
4,Clive H.,5,"What a fantastic location, right in the heart ...",2022-07-05 09:10:56,yvbBx06xhwEyvXITlGji5g
...,...,...,...,...,...
493,Moumita B.,5,Right in the middle of Lisbon and in the most ...,2015-09-09 19:00:58,Ydmp0bv8rpmHsuBcSAo_tA
494,Stefano Z.,2,Except for being very close to my destination ...,2015-12-04 15:35:31,XieXK380bk1DgOMzVJrhbQ
495,Ana W.,5,"Beautiful historic building, with elegant and ...",2019-09-07 18:42:17,_ZBCoK0xNKw3NkQ21XArqg
496,Matt B.,4,"I stayed at Turim Alameda hotel, because it wa...",2016-05-12 13:35:11,e65ge0YP1I8FRzRSPvA4xQ


# Joining DFs

In [29]:
# df_details = pd.read_csv("data/wineries_price_converted.csv")
# df_reviews = pd.read_csv("data/reviews_raw.csv")

In [30]:
# df_details.info()

In [31]:
# df_reviews.info()

In [32]:
# df_merged = df_details.merge(df_reviews, how='outer', on='Business ID')
# df_merged

In [33]:
# df_merged.info()

----

# Part 4 -  Using python and pandas, write code to answer the questions below. 

**Reviews**

Which are the 5 most reviewed businesses in your dataset?

In [34]:
# df_saved.sort_values('review_count', ascending=False).head()[['name','review_count']]

What is the highest rating received in your data set and how many businesses have that rating?

In [35]:
# max_rating = df_saved['rating'].max()

# df_saved[df_saved['rating']== max_rating].shape

In [36]:
# sns.__version__

In [37]:
# df_saved

# Converting Price to Int

In [38]:
# df_saved['price'].value_counts(dropna=False)

In [39]:
# df_saved['price_converted'] = df_saved['price'].map({np.nan:0, '$':1, '$$':2, '$$$':3})

In [40]:
# sns.countplot(x= 'price_converted', data=df_saved);

In [41]:
# sns.countplot(x= 'rating',data=df_saved);

>- `hue` parameter - seaborn

What percentage of businesses have a rating greater than or  4.5?

In [42]:
# df_total_high = df_saved[df_saved['rating'] >= 4.5].shape[0]
# df_total_high / df_saved.shape[0]

What percentage of businesses have a rating less than 3?

In [43]:
# df_total_low = df_saved[df_saved['rating'] < 3].shape[0]
# df_total_low

In [44]:
# df_total_low / df_saved.shape[0]

---

**Pricing**

What percentage of your businesses have a price label of one dollar sign? Two dollar signs? Three dollar signs? No dollar signs?

In [45]:
# df_total_zero = df_saved[df_saved['price_converted'] == 0].shape[0]
# print(df_total_zero / df_saved.shape[0])
# df_total_one = df_saved[df_saved['price_converted'] == 1].shape[0]
# print(df_total_one / df_saved.shape[0])
# df_total_two = df_saved[df_saved['price_converted'] == 2].shape[0]
# print(df_total_two / df_saved.shape[0])
# df_total_three = df_saved[df_saved['price_converted'] == 3].shape[0]
# print(df_total_three / df_saved.shape[0])

---

**Returing Reviews**

Return the text of the reviews for the most reviewed business. 

In [46]:
# df_saved.keys()

In [47]:
# max_reviewed = df_saved['review_count'].max()
# most_reviewed_id = df_saved[df_saved['review_count'] == max_reviewed]['business id']
# most_reviewed_id
# # test_all_funct[test_all_funct['business id'] == 'DknnpiG1p4OoM1maFshzXA']


In [48]:
# test_all_funct.keys()

Find the highest rated business and return text of the most recent review. If multiple business have the same rating, select the business with the most reviews. 

In [49]:
# top_b = df_saved[df_saved['rating'] == max_rating]
# max_counts = df_saved['review_count'].max()
# top_num_reviews_top_b = top_b['review_count'].max()
# top_b[top_b['review_count'] == top_num_reviews_top_b]

Find the lowest rated business and return text of the most recent review.  If multiple business have the same rating, select the business with the least reviews. 

In [50]:
# min_b = df_saved['rating'].min()
# min_b = df_saved[df_saved['rating'] == min_b]
# min_counts = df_saved['review_count'].min()
# min_num_reviews_min_b = min_b['review_count'].min()
# min_b[min_b['review_count'] == min_num_reviews_min_b]

# Reference help

###  Pagination

Returning to the Yelp API, the [documentation](https://www.yelp.com/developers/documentation/v3/business_search) also provides us details regarding the API limits. These often include details about the number of requests a user is allowed to make within a specified time limit and the maximum number of results to be returned. In this case, we are told that any request has a maximum of 50 results per request and defaults to 20. Furthermore, any search will be limited to a total of 1000 results. To retrieve all 1000 of these results, we would have to page through the results piece by piece, retriving 50 at a time. Processes such as these are often refered to as pagination.

Now that you have an initial response, you can examine the contents of the json container. For example, you might start with ```response.json().keys()```. Here, you'll see a key for `'total'`, which tells you the full number of matching results given your query parameters. Write a loop (or ideally a function) which then makes successive API calls using the offset parameter to retrieve all of the results (or 5000 for a particularly large result set) for the original query. As you do this, be mindful of how you store the data. 

**Note: be mindful of the API rate limits. You can only make 5000 requests per day, and APIs can make requests too fast. Start prototyping small before running a loop that could be faulty. You can also use time.sleep(n) to add delays. For more details see https://www.yelp.com/developers/documentation/v3/rate_limiting.**

***Below is sample code that you can use to help you deal with the pagination parameter and bring all of the functions together.***


***Also, something might cause your code to break while it is running. You don't want to constantly repull the same data when this happens, so you should insert the data into the database as you call and parse it, not after you have all of the data***


In [51]:
# # create a variable  to keep track of which result you are in. 
# cur = 0

# #set up a while loop to go through and grab the result 
# while cur < num and cur < 1000:
#     #set the offset parameter to be where you currently are in the results 
#     url_params['offset'] = cur
#     #make your API call with the new offset number
#     results = yelp_call(url_params, api_key)
    
#     #after you get your results you can now use your function to parse those results
#     parsed_results = parse_results(results)
    
#     # use your function to insert your parsed results into the db
#     db_insert(parsed_results)
#     #increment the counter by 50 to move on to the next results
#     cur += 20