# Part 1 - Extracting and Saving Data from Yelp API

## Obective

- For this CodeAlong, we will be working with the Yelp API. 
- You will use the the Yelp API to search your home town for a cuisine type of your choice.
- Next class, we will then use Plotly Express to create a map with the Mapbox API to visualize the results.
    
    

## Tools You Will Use
- Part 1:
    - Yelp API:
        - Getting Started: 
            - https://www.yelp.com/developers/documentation/v3/get_started

    - `YelpAPI` python package
        -  "YelpAPI": https://github.com/gfairchild/yelpapi
- Part 2:

    - Plotly Express: https://plotly.com/python/getting-started/
        - With Mapbox API: https://www.mapbox.com/
        - `px.scatter_mapbox` [Documentation](https://plotly.com/python/scattermapbox/): 




### Applying Code From
- Efficient API Calls Lesson Link: https://login.codingdojo.com/m/376/12529/88078

In [1]:
# Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Additional Imports
import os, json, math, time
from yelpapi import YelpAPI
from tqdm.notebook import tqdm_notebook

In [2]:
!pip install yelpapi --quiet

## 1. Registering for Required APIs


- Yelp: https://www.yelp.com/developers/documentation/v3/get_started


> Check the official API documentation to know what arguments we can search for: https://www.yelp.com/developers/documentation/v3/business_search

### Load Credentials and Create Yelp API Object

In [3]:
# Load API Credentials
relative_path = os.path.join('.secret', 'yelp_api.json')

In [4]:
# Instantiate YelpAPI Variable
with open('.secret/yelp_api.json') as file:
    yelp_credentials = json.load(file)
    
yelp_api = YelpAPI(yelp_credentials['api-key'], timeout_s=5.0)

In [5]:
import os

# Print the current working directory
print("Current working directory:", os.getcwd())

# List the contents of the '.secret' directory
try:
    print("Contents of .secret directory:", os.listdir('.secret'))
except FileNotFoundError:
    print(".secret directory not found.")

# Attempt to open the file
try:
    with open('.secret/yelp_api.json') as f:
        # Your file reading code here
        pass
except FileNotFoundError:
    print("The file '.secret/yelp_api.json' was not found.")

Current working directory: D:\My Documents\GitHub\data-enrichment-wk14-activity-mapping-yelp-api-results
Contents of .secret directory: ['yelp_api.json']


In [7]:
import os

# Rename the file from 'yelp_api.json.txt' to 'yelp_api.json'
# os.rename('.secret/yelp_api.json.txt', '.secret/yelp_api.json')

In [8]:
import os

# Print the current working directory
print("Current working directory:", os.getcwd())

# List the contents of the '.secret' directory
try:
    print("Contents of .secret directory:", os.listdir('.secret'))
except FileNotFoundError:
    print(".secret directory not found.")

# Attempt to open the file
try:
    with open('.secret/yelp_api.json') as f:
        # Your file reading code here
        pass
except FileNotFoundError:
    print("The file '.secret/yelp_api.json' was not found.")

Current working directory: D:\My Documents\GitHub\data-enrichment-wk14-activity-mapping-yelp-api-results
Contents of .secret directory: ['yelp_api.json']


In [9]:
# Instantiate YelpAPI Variable
with open('.secret/yelp_api.json') as file:
    yelp_credentials = json.load(file)
    
yelp_api = YelpAPI(yelp_credentials['api-key'], timeout_s=5.0)

### Define Search Terms and File Paths

In [10]:
# Define API call parameters and output file path
LOCATION = 'Greenville, SC'
TERM = 'Sushi'
JSON_FILE = '/Data/results_SC_Sushi.json'

# Display the file path where data will be saved
print(f'Data will be saved to: {JSON_FILE}')

Data will be saved to: /Data/results_SC_Sushi.json


### Check if Json File exists and Create it if it doesn't

In [11]:
# Check if JSON_FILE exists and create it if it doesn't
if not os.path.isfile(JSON_FILE):
    
    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(JSON_FILE), exist_ok=True)
    
    # Inform user and save an empty list to file
    print(f'[i] {JSON_FILE} not found. Saving empty list to file.')
    with open(JSON_FILE, 'w') as file:
        json.dump([], file)
else:
    # Inform user if the file already exists
    print(f'[i] {JSON_FILE} already exists.')

[i] /Data/results_SC_Sushi.json already exists.


### Load JSON FIle and account for previous results

In [12]:
# Load previous results and set offset based on the number of results
with open(JSON_FILE, 'r') as file:
    previous_results = json.load(file)

n_results = len(previous_results)

print(f'- {n_results} previous results found.')

- 0 previous results found.


### Make the first API call to get the first page of data

- We will use this first result to check:
    - how many total results there are?
    - Where is the actual data we want to save?
    - how many results do we get at a time?


In [13]:
# use our yelp_api variable's search_query method to perform our API call
# use our yelp_api variable's search_query method to perform our API call
results = yelp_api.search_query(location = LOCATION,
                                term = TERM,
                                offset = n_results)
results.keys()

dict_keys(['businesses', 'total', 'region'])

In [14]:
## How many results total?
total_results = results['total']
total_results

110

- Where is the actual data we want to save?

In [15]:
business_data = results['businesses']

# specify the filename where you want to save the data
json_file_path = JSON_FILE

# save the business data to a JSON file
with open(json_file_path, 'w') as file:
    json.dump(business_data, file, indent = 4)

In [16]:
## How many did we get the details for?
results_per_page = len(business_data)
print(f'number of results retrieved per page', results_per_page)

number of results retrieved per page 20


- Calculate how many pages of results needed to cover the total_results

In [17]:
# Use math.ceil to round up for the total number of pages of results.
n_pages = math.ceil(total_results / results_per_page)
print(f'Total number of pages: {n_pages}')

Total number of pages: 6


In [None]:
results = yelp_api.search_query(location=LOCATION, term=TERM, offset=n_results)

total_results = results['total']
business_data = results['businesses']

with open(JSON_FILE, 'w') as file:
    json.dump(business_data, file, indent=4)

results_per_page = len(business_data)

# Check if there are any results per page to avoid division by zero
if results_per_page > 0:
    n_pages = math.ceil(total_results / results_per_page)
else:
    n_pages = 0  # No pages if there are no results

print(f'Number of results retrieved per page: {results_per_page}')
print(f'Total number of pages: {n_pages}')

# Additional handling for when there are no business results
if n_pages == 0:
    print("No business data found for the given search parameters.")

In [None]:
# Assuming `results_per_call` and `total_iterations` are correctly calculated before this snippet.
for i in tqdm_notebook(range(1, total_results + 1)):
    try:
        time.sleep(0.2)  # Short delay to respect API rate limits
        
        # Load existing results to append new data
        with open(JSON_FILE, 'r') as file:
            previous_results = json.load(file)

        # Fetch new results using the current length of previous_results as the offset
        new_results = yelp_api.search_query(location=LOCATION, term=TERM, offset=len(previous_results))

        # Append and save the updated results
        updated_results = previous_results + new_results['businesses']
        with open(JSON_FILE, 'w') as file:
            json.dump(updated_results, file)

    except Exception as e:
        if 'Too Many Requests for url' in str(e):
            print('Rate limit exceeded. Stopping data collection.')
            break  # Exit loop if rate limit is exceeded
        else:
            print(f'An error occurred: {e}')
            continue  # Continue to next iteration in case of other errors

## Open the Final JSON File with Pandas

In [None]:
# Load the final JSON file into a DataFrame
df = pd.read_json(JSON_FILE)

# Display the first and last few rows of the DataFrame
display(df.head(), df.tail())

# Check for duplicate entries based on the 'id' column
duplicate_count = df.duplicated(subset='id').sum()
print('\n')
print(f'Number of duplicate IDs: {duplicate_count}')

In [None]:
# Specify directory and base filename
directory = 'Data'
filename = 'final_results_SC_Sushi.csv.gz'  # Include .csv.gz extension here
path = os.path.join(directory, filename)

# Ensure that the 'Data' directory exists
os.makedirs(directory, exist_ok=True)

# Save DataFrame as a compressed CSV file (to save space)
df.to_csv(path, compression='gzip', index=False)

In [None]:
# Specify the correct JSON file name
json_file = 'Data/final_results_SC_Sushi.json'

# Save the DataFrame as JSON with optimal orientation for line-delimited JSON
df.to_json(json_file, orient='records', lines=True)

In [None]:
# Convert and Save as .CSV.GZ by replacing the file extension
csv_gz_file = json_file.replace('.json', '.csv.gz')

# Save the DataFrame as a compressed CSV without the index
df.to_csv(csv_gz_file, compression='gzip', index=False)

## Bonus: compare filesize with os module's `os.path.getsize`

In [None]:
# Compare File Sizes to demonstrate the efficiency of compression
if os.path.exists(json_file) and os.path.exists(csv_gz_file):
    size_json = os.path.getsize(json_file)
    size_csv_gz = os.path.getsize(csv_gz_file)

    print(f'JSON FILE: {size_json:,} Bytes')
    print(f'CSV.GZ FILE: {size_csv_gz:,} Bytes')

    # Calculate and display the compression ratio if the .csv.gz file is not empty
    if size_csv_gz > 0:
        compression_ratio = size_json / size_csv_gz
        print(f'The csv.gz file is {compression_ratio:.2f} times smaller than the JSON file.')
    else:
        print("CSV.GZ file size is 0, cannot compare sizes.")
else:
    print("One or both files do not exist, check file paths.")

## Next Class: Processing the Results and Mapping 