# Gathering Data

---

Data gathered from [Yelp Fusion API](https://www.yelp.com/developers/documentation/v3/business_search). The API authorizes a maximum of 5000 daily calls but we did not need to actually pull that many times since the maximum amount of unique pulls is actually 50 datapoints per call.  

In [None]:
# importing libraries to read .json and store the data into a dataframe

import requests
import time
import pandas as pd

## Setting Up the API Loop to Retrieve Business Data for New York City

We define the boroughs below for the API to retrieve business data. Because the API has a cap of 1000 unique data per filter, we had to design multiple calls and aim to retrieve up to 5000 data points. However, this technique will not guarantee every business we gather will be unique. Because there could be duplicate business even if we change the filter, we had to drop duplicates by using business ID as the indicator.

In [None]:
cities = ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']
data = []
count = 1

for city in cities:
    for i in range(20):
        URL = 'https://api.yelp.com/v3/businesses/search'
        API_KEY = '3TjUrfMsEdbVfUKeiU31J7vFJyZp20FKhSg_OJni2TOL_Lo8A9-Z0WG_T3_jFeGzh3-WBtszKN1AAFPuYLh-7mJ7sxP2YdMalw_D5y4ExcXL3c0xUST3Sp-OIGMvXXYx'
        params = {'location': city,
                  'limit': 50 ,
                  'offset': 50 * i}
        headers = {'Authorization': 'bearer %s' % API_KEY}
        resp = requests.get(url=URL, params = params, headers = headers)
        data.extend(resp.json()['businesses'])
        if len(data) == 0:
            break
        print(f'Pulling {count} times.')
        count += 1
        time.sleep(.5)

## Additional Gathering

We believe that the initial amount of data after applying city filters were not sufficient for our analysis. By applying a zip code search as a filter, we managed to expand the size of our data within Greater New York area by a sizable margin. Below is a search by zip code and by the same methodology, pulling multiple times from the same zip code. 

In [None]:
# creating a list of zip codes for the loop to cycle through
zip_code_list = [
 '10464',            
 '10304',
 '10028',
 '11221',
 '11369',
 '11104',
 '10065',
 '11374',
 '10310',
 '11206',
 '10039',
 '10473',
 '11434',
 '10468',
 '11233',
 '10040',
 '10075',
 '11418',
 '10455',
 '11220',
 '10302',
 '11210',
 '10460',
 '11370',
 '11378',
 '11209',
 '11102',
 '10037',
 '11367',
 '11232',
 '11213',
 '10452',
 '10471',
 '11357',
 '10307',
 '10456',
 '11436',
 '11203',
 '10801',
 '10453',
 '11365',
 '11356',
 '10303',
 '10466',
 '11234',
 '10020',
 '11228',
 '10004',
 '11207',
 '11433',
 '11432',
 '11415',
 '10038',
 '11366',
 '10457',
 '11239',
 '10710',
 '11364',
 '10128',
 '10459',
 '10470',
 '11379',
 '10474',
 '10803',
 '11430',
 '10007',
 '11414',
 '11417',
 '11421',
 '11109',
 '10701',
 '10550',
 '11212',
 '11208',
 '10006',
 '11362',
 '10111',
 '10708',
 '10281',
 '10168',
 '11236',
 '10169',
 '11423',
 '10704',
 '11360',
 '10005',
 '11412',
 '10528',
 '10120',
 '10176',
 '10706',
 '10118',
 '10552',
 '10103',
 '10112',
 '11241',
 '10158',
 '11416',
 '10080',
 '11003',
 '11413',
 '10121',
 '10583',
 '10543',
 '10705',
 '10178',
 '10154',
 '10311',
 '10271',
 '10177',
 '10553']

count = 1
for zip_code in zip_code_list:
    for i in range(5):
        URL = 'https://api.yelp.com/v3/businesses/search'
        API_KEY = '3TjUrfMsEdbVfUKeiU31J7vFJyZp20FKhSg_OJni2TOL_Lo8A9-Z0WG_T3_jFeGzh3-WBtszKN1AAFPuYLh-7mJ7sxP2YdMalw_D5y4ExcXL3c0xUST3Sp-OIGMvXXYx'
        params = {'location': zip_code,
                  'limit': 50,
                  'offset': 50 * i}
        headers = {'Authorization': 'bearer %s' % API_KEY}
        resp = requests.get(url=URL, params = params, headers = headers)
        businesses = resp.json()['businesses']
        if len(businesses) == 0:
            break
        data.extend(businesses)
        print(f'Pulling {count} times.')
        count += 1
        time.sleep(.5)

In addition to using zip code for additional data, we also use random addresses located around the city as a location point and pull from the API. We used a list of addresses below and let the API give us 1000 results per area near that address. We still have to be wary of duplicates so similar method will be used to remove unwanted data.

In [None]:
count = 1
addresses = ['129 Elmwood Ave, Brooklyn, NY 11230',
           '10 E 21st St, New York, NY 10010', 
           '1 E 161 St, The Bronx, NY 10451', 
           '123-01 Roosevelt Ave, Queens, NY 11368',
           '2800 Victory Blvd, Staten Island, NY 10314']

for address in addresses:
    for i in range(20):
        URL = 'https://api.yelp.com/v3/businesses/search'
        API_KEY = '3TjUrfMsEdbVfUKeiU31J7vFJyZp20FKhSg_OJni2TOL_Lo8A9-Z0WG_T3_jFeGzh3-WBtszKN1AAFPuYLh-7mJ7sxP2YdMalw_D5y4ExcXL3c0xUST3Sp-OIGMvXXYx'
        params = {'location': address,
                  'limit': 50,
                  'offset': 50 * i}
        headers = {'Authorization': 'bearer %s' % API_KEY}
        resp = requests.get(url=URL, params = params, headers = headers)
        businesses = resp.json()['businesses']
        if len(businesses) == 0:
            break
        data.extend(businesses)
        print(f'Pulling {count} times.')
        count += 1
        time.sleep(.5)

## Storing the Data

After the data was gathered, we stored the data in a Dataframe and exported the Dataframe into a .csv file. We will then proceed to analyze the data in another notebook.

In [None]:
nyc = pd.DataFrame(data)
nyc.drop_duplicates(subset='id', inplace = True)

We want to check the shape of the data to see how many unique data remain after we discard all duplicate businesses.

In [None]:
nyc.shape

We export the remaining file to a comma delimited file. The file will be labeled raw because no cleaning were performed as of yet and continuation of exploratory data analysis will be processed in the main notebook.

In [None]:
nyc.to_csv('./data/nyc_raw.csv')