# YELP Fusion API

* [Fusion Authentication page](https://docs.developer.yelp.com/docs/fusion-authentication)  
* [Yelp API docs](https://docs.developer.yelp.com/docs/getting-started)


##### I have went to the hyperlink above to research the process to retrieve API Key

1. Access site
2. Sign Up for Yelp wbsite (Sign in if you have an existing account)
3. Click 'Create App'
4. Fill Required Information and select submit
5. Record and save you API key in a .py file

## **API Usage**

## Daily API limit: *500*   **(Reviews: 3 per buisness)**

In [1]:
## I may have to web scrape for a list of business ids to return reviews

### Maybe do some realtime data pipeline to store data into a database?  

In [28]:
import requests
from env import YELP_ID, YELP_API_KEY, yelp_locale_url

headers = {
    "accept": "application/json",
    "Authorization": F"Bearer {YELP_API_KEY}"
}

response = requests.get(yelp_locale_url, headers=headers)

In [29]:
response

<Response [200]>

## Checked API credentials and endpoint connection and retrieval of data; now will look for alternate pagination method per the yelp fuision api parameters

In [None]:
r = response.json()

In [33]:
r['businesses'][1]

{'id': 'WG639VkTjmK5dzydd1BBJA',
 'alias': 'rubirosa-new-york-2',
 'name': 'Rubirosa',
 'image_url': 'https://s3-media3.fl.yelpcdn.com/bphoto/l0Phrnhhj78RFiDhLIOUyQ/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/rubirosa-new-york-2?adjust_creative=_cn4uqd0Qxmkxg3N98CKKQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=_cn4uqd0Qxmkxg3N98CKKQ',
 'review_count': 3192,
 'categories': [{'alias': 'italian', 'title': 'Italian'},
  {'alias': 'pizza', 'title': 'Pizza'}],
 'rating': 4.5,
 'coordinates': {'latitude': 40.722766, 'longitude': -73.996233},
 'transactions': ['pickup'],
 'price': '$$',
 'location': {'address1': '235 Mulberry St',
  'address2': '',
  'address3': '',
  'city': 'New York',
  'zip_code': '10012',
  'country': 'US',
  'state': 'NY',
  'display_address': ['235 Mulberry St', 'New York, NY 10012']},
 'phone': '+12129650500',
 'display_phone': '(212) 965-0500',
 'distance': 1922.0346803084792}

In [4]:
# # Make the GET request
# response = requests.get(url, headers=headers)

# # Check the response status
# if response.status_code == 200:
#     data = response.json()
#     # You can now work with the data from the response
#     # For example, you can access the reviews with data['reviews']
# else:
#     print(f"Request failed with status code {response.status_code}")

## Referencing the Fusion Yelp API below is the best method to paginate the data to first retrieve the business_ids in order for us to search and retrieve reviews for those specified businesses.

In [35]:
# required imports to implement paginatiom
import requests
import pandas as pd
import time
# env.py imports
from env import YELP_ID, YELP_API_KEY, yelp_locale_url

# API credentials and endpoint
api_key = YELP_API_KEY
api_endpoint = 'https://api.yelp.com/v3/businesses/search'

# Pagination parameters per the API businesses reference page
limit = 50  # Number of results per page
offset = 0  # Start with the first page

all_businesses = []  # List to store all business data

# Create the request URL with the 'limit' and 'offset' parameters
url = f'{api_endpoint}?location=New+York&limit={limit}&offset={offset}'

# While loop to make calls and retrieve data in a managable way by chunks
while True:
    headers = {
        'Authorization': f'Bearer {YELP_API_KEY}'
    }

    response = requests.get(url, headers=headers)

    # reponse code check
    if response.status_code != 200:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        break

    businesses = response.json().get('businesses', [])

    if not businesses:
        # No more results to fetch
        break

    # add data to all_businesses list
    all_businesses.extend(businesses)
    offset += limit  # Move to the next page

    # Sleep for 60 seconds to respect QPS rate limiting
    time.sleep(60)

# Now 'all_businesses' contains all the retrieved business data

# Convert the data into a DataFrame
df = pd.DataFrame(all_businesses)

Failed to retrieve data. Status code: 503


## Looking at the yelp business dataframe and ensure good data is being used to cross reference

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29200 entries, 0 to 29199
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             29200 non-null  object 
 1   alias          29200 non-null  object 
 2   name           29200 non-null  object 
 3   image_url      29200 non-null  object 
 4   is_closed      29200 non-null  bool   
 5   url            29200 non-null  object 
 6   review_count   29200 non-null  int64  
 7   categories     29200 non-null  object 
 8   rating         29200 non-null  float64
 9   coordinates    29200 non-null  object 
 10  transactions   29200 non-null  object 
 11  price          28873 non-null  object 
 12  location       29200 non-null  object 
 13  phone          29200 non-null  object 
 14  display_phone  29200 non-null  object 
 15  distance       29200 non-null  float64
dtypes: bool(1), float64(2), int64(1), object(12)
memory usage: 3.4+ MB


## After checking it, decided to cache the data into a csv file for easier retrieval; may visit if time permits to pipeline into a database with live data being used as it is updated.

In [41]:
df.to_csv('yelp_businesseso.csv', index=False)

In [42]:
df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,veq1Bl1DW3UWMekZJUsG1Q,gramercy-tavern-new-york,Gramercy Tavern,https://s3-media2.fl.yelpcdn.com/bphoto/f14WAm...,False,https://www.yelp.com/biz/gramercy-tavern-new-y...,3403,"[{'alias': 'newamerican', 'title': 'American (...",4.5,"{'latitude': 40.73844, 'longitude': -73.98825}",[delivery],$$$$,"{'address1': '42 E 20th St', 'address2': '', '...",12124770777,(212) 477-0777,3695.639928
1,ysqgdbSrezXgVwER2kQWKA,julianas-brooklyn-3,Juliana's,https://s3-media2.fl.yelpcdn.com/bphoto/od36nF...,False,https://www.yelp.com/biz/julianas-brooklyn-3?a...,2700,"[{'alias': 'pizza', 'title': 'Pizza'}]",4.5,"{'latitude': 40.70274718768062, 'longitude': -...",[delivery],$$,"{'address1': '19 Old Fulton St', 'address2': '...",17185966700,(718) 596-6700,318.876261
2,nRO136GRieGtxz18uD61DA,eleven-madison-park-new-york,Eleven Madison Park,https://s3-media1.fl.yelpcdn.com/bphoto/s_H7gm...,False,https://www.yelp.com/biz/eleven-madison-park-n...,2451,"[{'alias': 'newamerican', 'title': 'American (...",4.5,"{'latitude': 40.7416907417333, 'longitude': -7...",[],$$$$,"{'address1': '11 Madison Ave', 'address2': '',...",12128890905,(212) 889-0905,4062.92957
3,h37t9rA06Sr4EetJjKrfzw,don-angie-new-york,Don Angie,https://s3-media2.fl.yelpcdn.com/bphoto/onJX6_...,False,https://www.yelp.com/biz/don-angie-new-york?ad...,785,"[{'alias': 'italian', 'title': 'Italian'}, {'a...",4.5,"{'latitude': 40.73778, 'longitude': -74.00197}",[delivery],$$$,"{'address1': '103 Greenwich Ave', 'address2': ...",12128898884,(212) 889-8884,3646.541688
4,O1fUmxt3kbV-rnyjBtzAfw,thep-thai-restaurant-new-york-5,THEP Thai Restaurant,https://s3-media3.fl.yelpcdn.com/bphoto/WymEpZ...,False,https://www.yelp.com/biz/thep-thai-restaurant-...,2790,"[{'alias': 'thai', 'title': 'Thai'}, {'alias':...",4.5,"{'latitude': 40.77078, 'longitude': -73.95727}","[pickup, delivery]",$$,"{'address1': '1439 2nd Ave', 'address2': '', '...",12128999995,(212) 899-9995,7893.531804


 ## I loaded the ny data that nick prepped to strategize what can be done here to match the dataframe here to have some usable business_id's to retrieve the reviews for respective businesses. This may involve a library outside of our knowledge scope. 

### - I'm thinking spacy or fuzzwuzzy that uses accuracy ratio based on a matching algorithm. but I think i'm just going to use the python NLTK library

In [50]:
ny_df = pd.read_csv('ny.csv')

In [49]:
ny_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207929 entries, 0 to 207928
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   camis                  207929 non-null  int64  
 1   dba                    207421 non-null  object 
 2   boro                   207929 non-null  object 
 3   building               207578 non-null  object 
 4   street                 207923 non-null  object 
 5   zipcode                205249 non-null  float64
 6   phone                  207922 non-null  object 
 7   inspection_date        207929 non-null  object 
 8   critical_flag          207929 non-null  object 
 9   record_date            207929 non-null  object 
 10  latitude               207672 non-null  float64
 11  longitude              207672 non-null  float64
 12  community_board        204682 non-null  float64
 13  council_district       204678 non-null  float64
 14  census_tract           204678 non-nu

In [46]:
ny_df.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,inspection_date,critical_flag,record_date,...,bbl,nta,cuisine_description,action,violation_code,violation_description,score,grade,grade_date,inspection_type
0,50106756,UNGARO COAL FIRED PIZZA CAFE,Staten Island,1298,FOREST AVENUE,10302.0,6464690930,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,...,5003870000.0,SI07,,,,,,,,
1,50105716,STELLA'S,Brooklyn,559,5 AVENUE,11215.0,4155703174,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,...,3010480000.0,BK37,,,,,,,,
2,41168748,DUNKIN,Bronx,880,GARRISON AVENUE,10474.0,7188614171,2022-03-30T00:00:00.000,Not Critical,2023-10-26T06:00:11.000,...,2027390000.0,BX27,Donuts,Violations were cited in the following area(s).,10J,Hand wash sign not posted,13.0,A,2022-03-30T00:00:00.000,Cycle Inspection / Initial Inspection
3,50131566,EXTACY LOUNGE,Queens,7701,JAMAICA AVE,11421.0,3478752367,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,...,4088410000.0,QN53,,,,,,,,
4,50128764,RUNNING KIDS,Brooklyn,856,64 STREET,11220.0,7188338856,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,...,3057420000.0,BK34,,,,,,,,


## Matching ny businesses with yelp ids for review retrieval using NLTK/Regex

In [None]:
# decided to go with NLTK versus the pretrained models 
import nltk
from nltk.tokenize import word_tokenize
from nltk.metrics import jaccard_distance
from nltk.corpus import stopwords

# Load NLTK stopwords
# nltk.download('stopwords')
# nltk.download('punkt')

# load dataframes
yelp_df = pd.read_csv('yelp_businesses.csv')
ny_df = pd.read_csv('ny.csv')

# Preprocess the 'name' and 'dba' columns (lowercased the string values and used strip for the white spaces
yelp_df['name'] = yelp_df['name'].str.lower().str.strip()
ny_df['dba'] = ny_df['dba'].str.lower().str.strip()

# setup an empty dictionary to store matched IDs with 'dba' as keys and Yelp IDs as values
matched_ids = {}

# defined a set of stopwords
stop_words = set(stopwords.words('english'))

# for loop to iterate through businesses in the Yelp dataframe
for index, yelp_row in yelp_df.iterrows():
    yelp_name = yelp_row['name']
    yelp_tokens = set(word_tokenize(yelp_name))
    
    best_match_score = float('inf')  # Initialize with a high value
    best_match_id = None

    # second for loop to iterate through businesses in the NY Open Data dataframe
    for ny_index, ny_row in ny_df.iterrows():
        ny_dba = ny_row['dba']
        
        try:
            ny_tokens = set(word_tokenize(ny_dba))
        except TypeError:
            continue  # Skip entries that cannot be tokenized
        
        # Calculate Jaccard similarity and remove stopwords
        jaccard_sim = 1 - jaccard_distance(yelp_tokens - stop_words, ny_tokens - stop_words)

        # Update the best match if the similarity score is higher
        if jaccard_sim < best_match_score:
            best_match_score = jaccard_sim
            best_match_id = yelp_row['id']

    # Check if the best match meets your similarity threshold
    if best_match_score <= 0.2:  # You can adjust the threshold based on your data
        matched_ids[yelp_name] = best_match_id

[nltk_data] Downloading package stopwords to C:\Users\Marc
[nltk_data]     Aradillas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Marc
[nltk_data]     Aradillas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
