# YELP Fusion API

* [Fusion Authentication page](https://docs.developer.yelp.com/docs/fusion-authentication)  
* [Yelp API docs](https://docs.developer.yelp.com/docs/getting-started)


##### I have went to the hyperlink above to research the process to retrieve API Key

1. Access site
2. Sign Up for Yelp wbsite (Sign in if you have an existing account)
3. Click 'Create App'
4. Fill Required Information and select submit
5. Record and save you API key in a .py file

## **API Usage**

## Daily API limit: *500*   **(Reviews: 3 per buisness)**

In [1]:
## I may have to web scrape for a list of business ids to return reviews

### Maybe do some realtime data pipeline to store data into a database?  

In [28]:
import requests
from env import YELP_ID, YELP_API_KEY, yelp_locale_url

headers = {
    "accept": "application/json",
    "Authorization": F"Bearer {YELP_API_KEY}"
}

response = requests.get(yelp_locale_url, headers=headers)

In [29]:
response

<Response [200]>

## Checked API credentials and endpoint connection and retrieval of data; now will look for alternate pagination method per the yelp fuision api parameters

In [None]:
r = response.json()

In [33]:
r['businesses'][1]

{'id': 'WG639VkTjmK5dzydd1BBJA',
 'alias': 'rubirosa-new-york-2',
 'name': 'Rubirosa',
 'image_url': 'https://s3-media3.fl.yelpcdn.com/bphoto/l0Phrnhhj78RFiDhLIOUyQ/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/rubirosa-new-york-2?adjust_creative=_cn4uqd0Qxmkxg3N98CKKQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=_cn4uqd0Qxmkxg3N98CKKQ',
 'review_count': 3192,
 'categories': [{'alias': 'italian', 'title': 'Italian'},
  {'alias': 'pizza', 'title': 'Pizza'}],
 'rating': 4.5,
 'coordinates': {'latitude': 40.722766, 'longitude': -73.996233},
 'transactions': ['pickup'],
 'price': '$$',
 'location': {'address1': '235 Mulberry St',
  'address2': '',
  'address3': '',
  'city': 'New York',
  'zip_code': '10012',
  'country': 'US',
  'state': 'NY',
  'display_address': ['235 Mulberry St', 'New York, NY 10012']},
 'phone': '+12129650500',
 'display_phone': '(212) 965-0500',
 'distance': 1922.0346803084792}

In [4]:
# # Make the GET request
# response = requests.get(url, headers=headers)

# # Check the response status
# if response.status_code == 200:
#     data = response.json()
#     # You can now work with the data from the response
#     # For example, you can access the reviews with data['reviews']
# else:
#     print(f"Request failed with status code {response.status_code}")

## Referencing the Fusion Yelp API below is the best method to paginate the data to first retrieve the business_ids in order for us to search and retrieve reviews for those specified businesses.

In [35]:
# required imports to implement paginatiom
import requests
import pandas as pd
import time
# env.py imports
from env import YELP_ID, YELP_API_KEY, yelp_locale_url

# API credentials and endpoint
api_key = YELP_API_KEY
api_endpoint = 'https://api.yelp.com/v3/businesses/search'

# Pagination parameters per the API businesses reference page
limit = 50  # Number of results per page
offset = 0  # Start with the first page

all_businesses = []  # List to store all business data

# Create the request URL with the 'limit' and 'offset' parameters
url = f'{api_endpoint}?location=New+York&limit={limit}&offset={offset}'

# While loop to make calls and retrieve data in a managable way by chunks
while True:
    headers = {
        'Authorization': f'Bearer {YELP_API_KEY}'
    }

    response = requests.get(url, headers=headers)

    # reponse code check
    if response.status_code != 200:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        break

    businesses = response.json().get('businesses', [])

    if not businesses:
        # No more results to fetch
        break

    # add data to all_businesses list
    all_businesses.extend(businesses)
    offset += limit  # Move to the next page

    # Sleep for 60 seconds to respect QPS rate limiting
    time.sleep(60)

# Now 'all_businesses' contains all the retrieved business data

# Convert the data into a DataFrame
df = pd.DataFrame(all_businesses)

Failed to retrieve data. Status code: 503


## Looking at the yelp business dataframe and ensure good data is being used to cross reference

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29200 entries, 0 to 29199
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             29200 non-null  object 
 1   alias          29200 non-null  object 
 2   name           29200 non-null  object 
 3   image_url      29200 non-null  object 
 4   is_closed      29200 non-null  bool   
 5   url            29200 non-null  object 
 6   review_count   29200 non-null  int64  
 7   categories     29200 non-null  object 
 8   rating         29200 non-null  float64
 9   coordinates    29200 non-null  object 
 10  transactions   29200 non-null  object 
 11  price          28873 non-null  object 
 12  location       29200 non-null  object 
 13  phone          29200 non-null  object 
 14  display_phone  29200 non-null  object 
 15  distance       29200 non-null  float64
dtypes: bool(1), float64(2), int64(1), object(12)
memory usage: 3.4+ MB


## After checking it, decided to cache the data into a csv file for easier retrieval; may visit if time permits to pipeline into a database with live data being used as it is updated.

In [None]:
df.to_csv('yelp_businesses.csv', index=False)

In [None]:
df.head(3)

 ## I loaded the ny data that nick prepped to strategize what can be done here to match the dataframe here to have some usable business_id's to retrieve the reviews for respective businesses. This may involve a library outside of our knowledge scope. 

### - I'm thinking spacy or fuzzwuzzy that uses accuracy ratio based on a matching algorithm. but I think i'm just going to use the python NLTK library

## utilizing pretrained NLP models did not work so i just used pandas and my PC CPU cores to optimized and use parallelization to speed up matching process

  ### I decided to look at the ny data again and decided to see how many unique business and camis there are.

In [219]:
ny_df = pd.read_csv('ny.csv')

In [220]:
ny_df.head(3)

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,inspection_date,critical_flag,record_date,...,bbl,nta,cuisine_description,action,violation_code,violation_description,score,grade,grade_date,inspection_type
0,50106756,UNGARO COAL FIRED PIZZA CAFE,Staten Island,1298,FOREST AVENUE,10302.0,6464690930,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,...,5003870000.0,SI07,,,,,,,,
1,50105716,STELLA'S,Brooklyn,559,5 AVENUE,11215.0,4155703174,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,...,3010480000.0,BK37,,,,,,,,
2,41168748,DUNKIN,Bronx,880,GARRISON AVENUE,10474.0,7188614171,2022-03-30T00:00:00.000,Not Critical,2023-10-26T06:00:11.000,...,2027390000.0,BX27,Donuts,Violations were cited in the following area(s).,10J,Hand wash sign not posted,13.0,A,2022-03-30T00:00:00.000,Cycle Inspection / Initial Inspection


In [228]:
ny_df.latitude

0         40.626371
1         40.665416
2         40.816753
3         40.691730
4         40.632489
            ...    
207924    40.701806
207925    40.861919
207926    40.727022
207927    40.704018
207928     0.000000
Name: latitude, Length: 207929, dtype: float64

In [227]:
ny_df.longitude

0        -74.133111
1        -73.989417
2        -73.892364
3        -73.864648
4        -74.010704
            ...    
207924   -73.808704
207925   -73.843318
207926   -74.007378
207927   -74.012638
207928     0.000000
Name: longitude, Length: 207929, dtype: float64

In [212]:
ny_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207929 entries, 0 to 207928
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   camis                  207929 non-null  int64  
 1   dba                    207421 non-null  object 
 2   boro                   207929 non-null  object 
 3   building               207578 non-null  object 
 4   street                 207923 non-null  object 
 5   zipcode                205249 non-null  float64
 6   phone                  207922 non-null  object 
 7   inspection_date        207929 non-null  object 
 8   critical_flag          207929 non-null  object 
 9   record_date            207929 non-null  object 
 10  latitude               207672 non-null  float64
 11  longitude              207672 non-null  float64
 12  community_board        204682 non-null  float64
 13  council_district       204678 non-null  float64
 14  census_tract           204678 non-nu

In [213]:
unique_business_names = ny_df[['camis', 'dba']].drop_duplicates()

In [221]:
unique_business_names = unique_business_names.dropna(subset=['dba'])

In [225]:
unique_business_names.to_csv('ny_dbas.csv', index=False)

In [226]:
unique_business_names.dba

0         UNGARO COAL FIRED PIZZA CAFE
1                             STELLA'S
2                               DUNKIN
3                        EXTACY LOUNGE
4                         RUNNING KIDS
                      ...             
207591         KENNEDY CHICKEN & PIZZA
207635            LOS PERROS DE CHUCHO
207708                EMPLOYEE FEEDING
207830                    TAPAS ON LEX
207894                         MONARCH
Name: dba, Length: 27727, dtype: object

In [88]:
unique_business_names_list = unique_business_names.values.tolist()

In [150]:
unique_business_names_list[0:3]

[[50106756, 'UNGARO COAL FIRED PIZZA CAFE'],
 [50105716, "STELLA'S"],
 [41168748, 'DUNKIN']]

In [92]:
unique_business_names_count = ny_df[['camis', 'dba']].drop_duplicates().nunique()
unique_business_names_count

camis    28232
dba      22114
dtype: int64

#### I found that there are more unique ids versus dbas, which i suppose maynbe just cross referencing by dba would be a better attempt in getting matching dba and yelp ids in order to get reviews  

## Matching ny businesses with yelp ids for review retrieval using NLTK/Regex (no success)

In [None]:
# # decided to go with NLTK/Regex versus the pretrained models 
# import pandas as pd
# import nltk
# from nltk.tokenize import word_tokenize
# from nltk.metrics import jaccard_distance
# from nltk.corpus import stopwords
# import re
# import unicodedata
# import tensorflow as tf


# # Load NLTK stopwords
# # nltk.download('stopwords')
# # nltk.download('punkt')

# # load dataframes
# yelp_df = pd.read_csv('yelp_businesses.csv')
# ny_df = pd.read_csv('ny.csv')

# # Preprocess the 'name' and 'dba' columns
# def basic_clean(text_data):
#     text_data = text_data.lower()
#     text_data = unicodedata.normalize('NFKD', text_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')
#     text_data = re.sub(r'[^a-z0-9\s]', '', text_data)
#     return text_data

# # Preprocess the 'name' and 'dba' columns (lowercased the string values and used strip for the white spaces
# yelp_df['name'] = yelp_df['name'].str.lower().str.strip()
# ny_df['dba'] = ny_df['dba'].str.lower().str.strip()

# # setup an empty dictionary to store matched IDs with 'dba' as keys and Yelp IDs as values
# matched_ids = {}

# # defined a set of stopwords
# stop_words = set(stopwords.words('english'))

# # for loop to iterate through businesses in the Yelp dataframe
# for index, yelp_row in yelp_df.iterrows():
#     yelp_name = yelp_row['name']
#     yelp_tokens = set(word_tokenize(yelp_name))
    
#     best_match_score = float('inf')  # Initialize with a high value
#     best_match_id = None

#     # second for loop to iterate through businesses in the NY Open Data dataframe
#     for ny_index, ny_row in ny_df.iterrows():
#         ny_dba = ny_row['dba']
        
#         try:
#             ny_tokens = set(word_tokenize(ny_dba))
#         except TypeError:
#             continue  # Skip entries that cannot be tokenized
        
#         # Calculate Jaccard similarity and remove stopwords
#         jaccard_sim = 1 - jaccard_distance(yelp_tokens - stop_words, ny_tokens - stop_words)

#         # Update the best match if the similarity score is higher
#         if jaccard_sim < best_match_score:
#             best_match_score = jaccard_sim
#             best_match_id = yelp_row['id']

#     # Check if the best match meets your similarity threshold
#     if best_match_score <= 0.8:  # You can adjust the threshold based on your data
#         matched_ids[yelp_name] = best_match_id

In [1]:
import sys
sys.executable

'C:\\tools\\Anaconda3\\python.exe'

In [2]:
!python -m spacy info

[1m

spaCy version    3.7.2                         
Location         C:\Users\Marc Aradillas\AppData\Local\Programs\Python\Python311\Lib\site-packages\spacy
Platform         Windows-10-10.0.23570-SP0     
Python version   3.11.4                        
Pipelines        en_core_web_sm (3.7.0)        



In [3]:
!python -m spacy validate



| Loading compatibility table...
[2K[38;5;2m[+] Loaded compatibility table[0m
[1m
[38;5;4m[i] spaCy installation: C:\Users\Marc
Aradillas\AppData\Local\Programs\Python\Python311\Lib\site-packages\spacy[0m

NAME             SPACY            VERSION                              
en_core_web_sm   >=3.7.0,<3.8.0   [38;5;2m3.7.0[0m   [38;5;2m[+][0m



In [4]:
!python --version


Python 3.11.4


In [7]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB 5.7 MB/s eta 0:00:03
     -- ------------------------------------- 0.8/12.8 MB 8.1 MB/s eta 0:00:02
     --- ------------------------------------ 1.3/12.8 MB 9.0 MB/s eta 0:00:02
     ----- ---------------------------------- 1.9/12.8 MB 9.9 MB/s eta 0:00:02
     ------- -------------------------------- 2.5/12.8 MB 10.8 MB/s eta 0:00:01
     ---------- ----------------------------- 3.3/12.8 MB 11.6 MB/s eta 0:00:01
     ------------ --------------------------- 4.2/12.8 MB 12.6 MB/s eta 0:00:01
     ---------------- ----------------------- 5.2/12.8 MB 13.8 MB/s eta 0:00:01
     ------------------- -------------------- 6.4/12.8 MB 15.1 MB/s eta 0:00:01
     ----------------------- -------

In [6]:
import spacy
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## finally got spacy to be recognized!

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

## NLTK Took too long, I am now going to use spaCy
## Spacy took a very long time (No Success)

In [None]:
# import pandas as pd
# import nltk
# from nltk.metrics import jaccard_distance
# import spacy
# import re

# # Load the spaCy model
# import spacy

# # Load the spaCy model by its name
# nlp = spacy.load('en_core_web_sm')

# # Load dataframes
# yelp_df = pd.read_csv('yelp_businesses.csv')
# ny_df = pd.read_csv('ny.csv')

# # Preprocess the 'name' and 'dba' columns
# def basic_clean(text_data):
#     text_data = text_data.lower()
#     text_data = re.sub(r'[^a-z0-9\s]', '', text_data)
#     return text_data

# # # Preprocess the 'name' and 'dba' columns (lowercased the string values and used strip for the white spaces)
# yelp_df['name'] = yelp_df['name'].str.lower().str.strip()
# ny_df['dba'] = ny_df['dba'].str.lower().str.strip()

# # Set up an empty dictionary to store matched IDs with 'dba' as keys and Yelp IDs as values
# matched_ids = {}

# # For text preprocessing and similarity calculation
# for index, yelp_row in yelp_df.iterrows():
#     yelp_name = yelp_row['name']
#     yelp_doc = nlp(basic_clean(yelp_name))

#     best_match_score = float('inf')  # Initialize with a high value
#     best_match_id = None

#     ny_df = ny_df[ny_df['dba'].notnull() & ny_df['dba'].apply(lambda x: isinstance(x, str))]


#     # Second loop to iterate through businesses in the NY Open Data dataframe
#     for ny_index, ny_row in ny_df.iterrows():
#         ny_dba = ny_row['dba']
#         ny_doc = nlp(basic_clean(ny_dba))

#         # Calculate Jaccard similarity (Jaccard similarity is not part of spaCy, so we still use NLTK for that)
#         jaccard_sim = 1 - jaccard_distance(set([token.text for token in yelp_doc]),
#                                            set([token.text for token in ny_doc]))

#         # Update the best match if the similarity score is higher
#         if jaccard_sim < best_match_score:
#             best_match_score = jaccard_sim
#             best_match_id = yelp_row['id']

#     # Check if the best match meets your similarity threshold
#     if best_match_score <= 0.8:  # You can adjust the threshold based on your data
#         matched_ids[yelp_name] = best_match_id

## I am now going to use my PC's 12 cores do divy up the work and also implement fuzzywuzzy which is a python library with string matching
### *** FuzzyWuzzy ultimately unecessary. ***

In [2]:
pip install fuzzywuzzy

Defaulting to user installation because normal site-packages is not writeable
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Note: you may need to restart the kernel to use updated packages.




In [3]:
!python -m fuzzywuzzy validate

C:\Users\Marc Aradillas\AppData\Local\Programs\Python\Python311\python.exe: No module named fuzzywuzzy


In [None]:
# C:\Users\Marc Aradillas\AppData\Roaming\Python\Python310\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
#   warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
pip install python-Levenshtein

### decided to read what i could cache and investigate my yelp data

In [154]:
import pandas as pd
yelp_df = pd.read_csv('yelp_businesses.csv')

In [155]:
yelp_df.head(1)

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,veq1Bl1DW3UWMekZJUsG1Q,gramercy-tavern-new-york,Gramercy Tavern,https://s3-media2.fl.yelpcdn.com/bphoto/f14WAm...,False,https://www.yelp.com/biz/gramercy-tavern-new-y...,3403,"[{'alias': 'newamerican', 'title': 'American (...",4.5,"{'latitude': 40.73844, 'longitude': -73.98825}",['delivery'],$$$$,"{'address1': '42 E 20th St', 'address2': '', '...",12124770000.0,(212) 477-0777,3695.639928


  ## finding unique id and name form yelp data

In [199]:
# Preprocess the 'name' and 'dba' columns
yelp_df['name'] = yelp_df['name'].str.lower().str.strip()

In [200]:
yelp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29200 entries, 0 to 29199
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             29200 non-null  object 
 1   alias          29200 non-null  object 
 2   name           29200 non-null  object 
 3   image_url      29200 non-null  object 
 4   is_closed      29200 non-null  bool   
 5   url            29200 non-null  object 
 6   review_count   29200 non-null  int64  
 7   categories     29200 non-null  object 
 8   rating         29200 non-null  float64
 9   coordinates    29200 non-null  object 
 10  transactions   29200 non-null  object 
 11  price          28873 non-null  object 
 12  location       29200 non-null  object 
 13  phone          28266 non-null  float64
 14  display_phone  28266 non-null  object 
 15  distance       29200 non-null  float64
dtypes: bool(1), float64(3), int64(1), object(11)
memory usage: 3.4+ MB


In [204]:
unique_businesses = yelp_df.drop_duplicates(subset='id')
unique_businesses

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,veq1Bl1DW3UWMekZJUsG1Q,gramercy-tavern-new-york,gramercy tavern,https://s3-media2.fl.yelpcdn.com/bphoto/f14WAm...,False,https://www.yelp.com/biz/gramercy-tavern-new-y...,3403,"[{'alias': 'newamerican', 'title': 'American (...",4.5,"{'latitude': 40.73844, 'longitude': -73.98825}",['delivery'],$$$$,"{'address1': '42 E 20th St', 'address2': '', '...",1.212477e+10,(212) 477-0777,3695.639928
1,ysqgdbSrezXgVwER2kQWKA,julianas-brooklyn-3,gramercy tavern,https://s3-media2.fl.yelpcdn.com/bphoto/od36nF...,False,https://www.yelp.com/biz/julianas-brooklyn-3?a...,2700,"[{'alias': 'pizza', 'title': 'Pizza'}]",4.5,"{'latitude': 40.70274718768062, 'longitude': -...",['delivery'],$$,"{'address1': '19 Old Fulton St', 'address2': '...",1.718597e+10,(718) 596-6700,318.876261
2,nRO136GRieGtxz18uD61DA,eleven-madison-park-new-york,gramercy tavern,https://s3-media1.fl.yelpcdn.com/bphoto/s_H7gm...,False,https://www.yelp.com/biz/eleven-madison-park-n...,2451,"[{'alias': 'newamerican', 'title': 'American (...",4.5,"{'latitude': 40.7416907417333, 'longitude': -7...",[],$$$$,"{'address1': '11 Madison Ave', 'address2': '',...",1.212889e+10,(212) 889-0905,4062.929570
3,h37t9rA06Sr4EetJjKrfzw,don-angie-new-york,gramercy tavern,https://s3-media2.fl.yelpcdn.com/bphoto/onJX6_...,False,https://www.yelp.com/biz/don-angie-new-york?ad...,785,"[{'alias': 'italian', 'title': 'Italian'}, {'a...",4.5,"{'latitude': 40.73778, 'longitude': -74.00197}",['delivery'],$$$,"{'address1': '103 Greenwich Ave', 'address2': ...",1.212890e+10,(212) 889-8884,3646.541688
4,O1fUmxt3kbV-rnyjBtzAfw,thep-thai-restaurant-new-york-5,gramercy tavern,https://s3-media3.fl.yelpcdn.com/bphoto/WymEpZ...,False,https://www.yelp.com/biz/thep-thai-restaurant-...,2790,"[{'alias': 'thai', 'title': 'Thai'}, {'alias':...",4.5,"{'latitude': 40.77078, 'longitude': -73.95727}","['pickup', 'delivery']",$$,"{'address1': '1439 2nd Ave', 'address2': '', '...",1.212900e+10,(212) 899-9995,7893.531804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17343,0IFDnYf3bhqxJR6hVrG7Gw,top-thai-vintage-new-york-3,gramercy tavern,https://s3-media3.fl.yelpcdn.com/bphoto/-ZoEVV...,False,https://www.yelp.com/biz/top-thai-vintage-new-...,1126,"[{'alias': 'thai', 'title': 'Thai'}, {'alias':...",4.5,"{'latitude': 40.729907419973344, 'longitude': ...","['delivery', 'restaurant_reservation', 'pickup']",$$$,"{'address1': '55 Carmine St', 'address2': None...",1.646609e+10,(646) 609-2272,2835.775712
17496,pZWhRtCJvTuwWoavaiCQrA,c-as-in-charlie-new-york,gramercy tavern,https://s3-media1.fl.yelpcdn.com/bphoto/s_IHLE...,False,https://www.yelp.com/biz/c-as-in-charlie-new-y...,191,"[{'alias': 'korean', 'title': 'Korean'}, {'ali...",4.5,"{'latitude': 40.72545, 'longitude': -73.99269}",[],,"{'address1': '5 Bleecker St', 'address2': None...",,,2223.191015
17498,lWOkeS-wV4no8qqA9OwwEg,doughnut-plant-new-york-6,gramercy tavern,https://s3-media1.fl.yelpcdn.com/bphoto/hLWKXs...,False,https://www.yelp.com/biz/doughnut-plant-new-yo...,3444,"[{'alias': 'donuts', 'title': 'Donuts'}, {'ali...",4.5,"{'latitude': 40.716337, 'longitude': -73.988577}","['pickup', 'delivery']",$$,"{'address1': '379 Grand St', 'address2': '', '...",1.212505e+10,(212) 505-3700,1295.265121
17547,Ys9iYSjuFDpZj7B1X07v5g,liberty-bagels-midtown-new-york-3,gramercy tavern,https://s3-media1.fl.yelpcdn.com/bphoto/o7Lmy_...,False,https://www.yelp.com/biz/liberty-bagels-midtow...,1444,"[{'alias': 'breakfast_brunch', 'title': 'Break...",4.5,"{'latitude': 40.75255, 'longitude': -73.99249}","['delivery', 'pickup']",$,"{'address1': '260 W 35th St', 'address2': '', ...",1.212279e+10,(212) 279-1124,5222.450101


In [167]:
unique_business_id = yelp_df[['id', 'name']].drop_duplicates()

In [170]:
unique_business_id_list = unique_business_id.values.tolist()

In [171]:
unique_business_id_list[0:3]

[['veq1Bl1DW3UWMekZJUsG1Q', 'gramercy tavern'],
 ['ysqgdbSrezXgVwER2kQWKA', "juliana's"],
 ['nRO136GRieGtxz18uD61DA', 'eleven madison park']]

In [172]:
unique_business_id_count = yelp_df[['id', 'name']].drop_duplicates().nunique()
unique_business_id_count

id      222
name    218
dtype: int64

### found only 218 unique business names, which means i retrieved manyy duplicates, which means i have to do anotehr API call only this time hopefully not return duplicate business names so i can have more ids to hopefully match the New York Open Data dba column

  ## I'm going to just create unique instances and generate a list just using parallelization

In [94]:
import pandas as pd
import concurrent.futures

# Load dataframes
yelp_df = pd.read_csv('yelp_businesses.csv')
ny_df = pd.read_csv('ny.csv')

# Preprocess the 'name' and 'dba' columns
yelp_df['name'] = yelp_df['name'].str.lower().str.strip()
ny_df['dba'] = ny_df['dba'].str.lower().str.strip()

# Step 1: Create a set of unique business DBAs from the NY DataFrame
unique_ny_dbas = set(ny_df['dba'])

# Step 2: Initialize a dictionary to store matching Yelp IDs
matching_yelp_ids = {}

# Create a function to find the first matching ID for parallel processing
def find_matching_id(yelp_name, yelp_id, ny_df):
    matching_id = None
    for ny_row in ny_df['dba']:
        if yelp_name == ny_row:
            matching_id = yelp_id
            break  # Stop searching after the first match
    return yelp_name, matching_id

# Use ThreadPoolExecutor for parallel processing with 12 cores
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
    # Submit tasks for each yelp name
    future_to_name = {executor.submit(find_matching_id, yelp_name, yelp_id, ny_df): (yelp_name, yelp_id) for yelp_name, yelp_id in zip(yelp_df['name'], yelp_df['id'])}
    
    # Retrieve results as they complete
    for future in concurrent.futures.as_completed(future_to_name):
        yelp_name, matching_id = future.result()
        if matching_id:
            matching_yelp_ids[yelp_name] = matching_id

# Now, matching_yelp_ids contains Yelp IDs that match unique business DBAs from the NY DataFrame


In [111]:
#'matching_yelp_ids' is a dictionary with business names as keys and lists of matching Yelp IDs as values
data = {'business_name': list(matching_yelp_ids.keys()), 'yelp_ids': [''.join(map(str, ids)) for ids in matching_yelp_ids.values()]}

# Create a DataFrame from the data
matched_df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
matched_df.to_csv('matched_businesses.csv', index=False)

In [187]:
matched_df.head(2)

Unnamed: 0,business_name,yelp_ids
0,valerie,zRXMvxUX_rOliKZPpkWi_g
1,crown shy,_0WFGjXuenlBixOlbSlGeQ


In [175]:
matched_df.business_name.to_list()

['valerie',
 'crown shy',
 'ocean prime',
 'mokyo',
 'tomi jazz',
 'soothr',
 'chili',
 'hanoi house',
 'otis',
 'dutch freds',
 "jacob's pickles",
 'ci siamo',
 'her name is han',
 'maison pickle',
 'manhatta',
 'double chicken please',
 'per se',
 'wayla',
 "junior's restaurant",
 'duane park',
 'gotham restaurant',
 'daniel',
 'buddakan',
 'thai villa',
 'eleven madison park',
 'thai diner',
 'momofuku ko',
 'benemon',
 'jua',
 'tara rose',
 'very fresh noodles',
 'jungsik',
 'da gennaro',
 'la contenta',
 'up thai',
 'le bernardin',
 'atera',
 'club a steakhouse',
 'miss ada',
 'bea',
 'fish cheeks',
 'cafe mogador',
 '5ive spice',
 'ippudo ny',
 'gelso & grand',
 'nai',
 "l'artusi",
 "jack's wife freda",
 'marc forgione',
 "katz's delicatessen",
 'piccola cucina osteria',
 'the modern',
 'wayan',
 'don angie',
 'secchu yokota',
 'lilia',
 'jean-georges',
 'joju',
 'le coucou',
 'los tacos no.1',
 'nerai',
 'mission ceviche',
 'city vineyard',
 'bua thai ramen & robata grill',
 'mi

In [183]:
matched_df.business_name.value_counts().sum()

128

  ## ***ONLY HAD 128 THAT MATCHED WITH NY DATA***

## so I had alot of duplicates in my yelp data, i need to add a break and store only unique ids when callin the api, hopefully this way I can return 29200 of hopefully unique business ids so I can match more businesses from the ny dataset by names

## attempting to use the parallelization technique and use all 12 of my cpu cores here.

In [None]:
# required imports to implement pagination and parallelization
import requests
import pandas as pd
import time
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from env import YELP_ID, YELP_API_KEY, yelp_locale_url

# API credentials and endpoint
api_key = YELP_API_KEY
api_endpoint = 'https://api.yelp.com/v3/businesses/search'

# Pagination parameters per the API businesses reference page
limit = 50  # Number of results per page
offset = 0  # Start with the first page

unique_business_ids = set()  # Set to store unique business IDs
all_businesses = []  # List to store all business data

# Define a function to fetch data for a specific page
def fetch_page(offset):
    url = f'{api_endpoint}?location=New+York&limit={limit}&offset={offset}'
    headers = {
        'Authorization': f'Bearer {YELP_API_KEY}'
    }

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        return response.json().get('businesses', [])
    else:
        return []

# Concurrently fetch pages of results
with ThreadPoolExecutor(max_workers=12) as executor:  # Adjust max_workers as needed
    while True:
        future_to_offset = {executor.submit(fetch_page, offset): offset}
        offset += limit

        # Wait for the tasks to complete
        for future in concurrent.futures.as_completed(future_to_offset):
            businesses = future.result()

            if not businesses:
                # No more results to fetch
                break

            for business in businesses:
                business_id = business.get('id')
                if business_id not in unique_business_ids:
                    unique_business_ids.add(business_id)
                    all_businesses.append(business)

        # Sleep for 60 seconds to respect QPS rate limiting
        time.sleep(60)

# Now 'all_businesses' contains unique business data

# Convert the data into a DataFrame
df = pd.DataFrame(all_businesses)

In [None]:
df.info()

In [None]:
df.to_csv('yelp_businesses.csv', index=False)

## The Next step here is to make another API call once my limit is reset and attempt to retrieve all reviews and append them to the dataframe