# Yelp Scraper Part I

Development notebook for business URI search and lens enhancement engine.

## Database Setup

Rather than attempt to keep information globally shared between modules database or messenger/caching services will be used.  For V0.1, an sqlilte implementation is planned.  

In [None]:
# Initialize Database
from db import get_db, get_session, init_db

init_db()  # Creates tables/relationships

## Yelp Fusion Client

This module uses the yelpapi library to manage connections and requests to Yelp's Fusion API.

In [3]:
from scraper_1_urls import load_environment, get_client

load_environment(from_file=True)
client = get_client()

Environment variables set.


## Lens

Lens is a handler of simple perceptron networks whose goal it is to create predictions (forward passes) of expected number of businesses within an area.  The Yelp API will only give a maxiumum of 50 businesses per search.  Lens helps tune the jump distance as the scraper adjusts longitude and lattitude while searching.

Two approaches will be investigated:

1. Attempt to predict the number of businesses in a given area.

2. Attempt to predict the number of NEW businesses in a given area.

V0.1 will look at the first strategy.

In [1]:
from lens import fastmap

### Create, train, and update a perceptron

The layer initialization step can be skipped by passing in a new y-vector (in our case, updated data from nearby zones.

> V0.1 will not use update as some type of copy_on_complete mechanism will be needed.  More logic to ensure that the quality, relations of and in y_train don't deviate significantly.  Otherwise, more training may have to be done to 'undo' bad weights. 

In [2]:
# Test Data
import numpy as np
X=np.array([[0,1,2], [3,4,5], [6,7,8]])
y=np.array([0.5,0.65,1])

model_map = fastmap.ModelMap()

In [3]:
model_map.pin_model(X, y, (20,30))

{'geohash': 'set3f8vk6wjr', 'file_location': '/tmp/set3f8vk6wjr.pkl'}

**At this point,** the model is ready for use.  The returned information needs to be stored so predictions can be called when needed.

## Get All Categories by Country

To help limit and organize the search, as well as provide better data to Lens as the search progresses, we will search by category in each direction (scraper path).

Yelp provides a json download with all their categories [here](https://www.yelp.com/developers/documentation/v3/all_category_list/categories.json)

In [1]:
# Load data
import json

with open('categories.json', 'r') as file:
    categories = json.load(file)

In [2]:
len(categories)

1565

In [22]:
# Get all parent categories
def get_countries(category):
    if 'country_whitelist' in category.keys():
        countries = category['country_whitelist']
        return countries
    return ['ALL']

def get_parents(category):
    parent = category['parents']
    parents = []
    if len(parent) > 1:
        for item in parent:
            if type(item) == str:
                parents.append(item)
    elif len(parent) == 1:
            parents.append(parent[0])
    return parents

countries = list(map(get_countries, categories))
parents = list(map(get_parents, categories))
data = {'parent':parents, 'country':countries}

In [23]:
# Build a dataframe with parent data for starters
import pandas as pd

df = pd.DataFrame(data=data)
df.head(5)

Unnamed: 0,parent,country
0,[localservices],[ALL]
1,[italian],[IT]
2,[bars],[CZ]
3,[food],[ALL]
4,[fashion],[ALL]


In [24]:
# Explode each column
df = df.explode('parent')
df = df.explode('country')
display(df.head(), df.describe())

Unnamed: 0,parent,country
0,localservices,ALL
1,italian,IT
2,bars,CZ
3,food,ALL
4,fashion,ALL


Unnamed: 0,parent,country
count,4102,4127
unique,121,34
top,restaurants,ALL
freq,502,1084


In [28]:
# Clean data
df = df.dropna()
df.describe()

Unnamed: 0,parent,country
count,4102,4102
unique,121,34
top,restaurants,ALL
freq,502,1063


In [36]:
# Save Data
from awstools import s3

# Setup bucket link to AWS
bucket = s3.Bucket('yelp-data-shared-labs18')

# Save locally
filename = 'parent_categories_by_country.json'
df.to_json(filename, orient='records')

# Upload to S3 bucket
bucket.save(filename, 'Tables/'+filename)

parent_categories_by_country.json  157156 / 157156.0  (100.00%)

In [3]:
# Get categories table from s3
from awstools import s3
bucket = s3.Bucket('yelp-data-shared-labs18')
bucket.get('Tables/parent_categories_by_country.json', 'parent_cat.json')

# Load json into dataframe and filter
import pandas as pd
cats = pd.read_json('parent_cat.json')
us_cats = set( # Set will maintain only distinct items
    cats.query('country == "US"').parent # Query dataframe and get series
)
us_cats = list(us_cats)  # sets are not subscriptable. cast to list.
us_cats[0:5]

['cannabis_clinics', 'restaurants', 'wholesalers', 'food', 'photographers']

## Scrape Category by (lat, lon)

Now that categories can be searched indvidually, effectively filtering result streams (and allowing larger hops in category), we can test the yelpfusion api with an example search in the form:

query(category=category, latitude=lat, longitude=lon, limit=limit)

In [2]:
from scraper_1_urls import load_environment, get_client

load_environment(from_file=True)
client = get_client()

Environment variables set.


In [6]:
# Example location (san francisco)
lat = 37.7739
lon = -122.431297
category = 'cannabis_clinics'

search_results = client.search_query(categories=category, latitude=lat, longitude=lon)

## Data Cleaning and Storage

In situ cleaning and storage in a local database prior writing to the RDS instance happens in two three.

1. The data cleaned and dumped into a pandas dataframe.
2. Each id is checked against the local database of collected businesses for unique ID.
3. Unique elements are written to the database and the dataframe is overwritten by the next search.

In [13]:
# Dump to dataframe
df = pd.DataFrame(search_results['businesses'])
display(df.head(1), df.columns)

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,location,phone,display_phone,distance
0,mOanZdaJQu4pBKAJShtHrA,foggy-daze-delivery-service-san-francisco,Foggy Daze Delivery Service,https://s3-media4.fl.yelpcdn.com/bphoto/CH_qJY...,False,https://www.yelp.com/biz/foggy-daze-delivery-s...,134,"[{'alias': 'cannabis_clinics', 'title': 'Canna...",5.0,"{'latitude': 37.76468, 'longitude': -122.43193}",[],"{'address1': '2261 Market St', 'address2': 'St...",14152007451,(415) 200-7451,1022.487674


Index(['id', 'alias', 'name', 'image_url', 'is_closed', 'url', 'review_count',
       'categories', 'rating', 'coordinates', 'transactions', 'location',
       'phone', 'display_phone', 'distance'],
      dtype='object')

In [68]:
# Building a cleaning function for searches

def clean_business_search(df: pd.DataFrame):
    temp = df.copy()
    # Filter and Rename Columns
    temp = df.filter(['id', 'name', 'image_url', 'coordinates', \
                      'review_count', 'is_closed', 'url', 'categories',\
                      'location', 'rating'])
    temp = temp.rename(columns={'id':'business_id', 'rating': 'stars'})
    
    # change is_closed to is_open (flip bool)
    temp['is_open'] = temp.is_closed.apply(lambda x: not x)
    temp = temp.drop(columns='is_closed')
    
    # parse location to address, city, state, postal_code
    temp['address'] = temp.location.apply(lambda x: x['address1']+str(x['address2']))
    temp['city'] = temp.location.apply(lambda x: x['city'])
    temp['state'] = temp.location.apply(lambda x: x['state'])
    temp['postal_code'] = temp.location.apply(lambda x: x['zip_code'])
    temp = temp.drop(columns='location')
    
    # clean categories down to alias (more similar to parent search)
    temp.categories = temp.categories.apply(lambda x: ','.join([z['alias'] for z in x]))
    
    # parse coordinates
    temp['latitude'] = temp.coordinates.apply(lambda x: x['latitude'])
    temp['longitude'] = temp.coordinates.apply(lambda x: x['longitude'])
    temp = temp.drop(columns='coordinates')
    
    return temp

clean_business_search(df).head(1)

Unnamed: 0,business_id,name,image_url,review_count,url,categories,is_open,address,city,state,postal_code,latitude,longitude
0,mOanZdaJQu4pBKAJShtHrA,Foggy Daze Delivery Service,https://s3-media4.fl.yelpcdn.com/bphoto/CH_qJY...,134,https://www.yelp.com/biz/foggy-daze-delivery-s...,cannabis_clinics,True,2261 Market StSte 289,San Francisco,CA,94114,37.76468,-122.43193


### Test Search Pipeline

Test the search interface

In [2]:
from scraper_1_urls import search

# Example location (san francisco)
lat = 37.7739
lon = -122.431297
category = 'cannabis_clinics'

results = search(latitude=lat, longitude=lon, category=category)

Environment variables set.


### Write to database

Export dataframe to database

In [3]:
from db import get_session
from models import *

with get_session() as session:
    for record in results.to_dict(orient='r'):
        session.add(Business(**record))
    
    session.commit()
    session.close()

### Filter for unique adds

Must obey primary key unique constraint.  Will also return the total number of unique hits for predicting hop distance!

In [17]:
def filter_unique(raw_frame):
    ## Check if id existing, if exists: drop, else keep
    raw_frame['exists'] = raw_frame.business_id.apply(check_exists)
    
    return raw_frame.query('exists == False')

def check_exists(x):
    # Check if business_id already in database
    with get_session() as session:
        exists = session.query(Business.business_id).filter(
            Business.business_id == x
        ).scalar()
        
    if exists:
        return True
    else:
        return False

In [18]:
filter_unique(results)

Unnamed: 0,business_id,name,image_url,review_count,url,categories,stars,is_open,address,city,state,postal_code,latitude,longitude,exists


In [19]:
check_exists('somerandomid') # Check false

False

### Save Results

Save number of unique results and category to table for training

In [None]:
def write_search_metadata(**kwargs):
    with get_session() as session:
        session.add(SearchResults(**kwargs['record']))

In [5]:
# Test with scraper class
from app import create_scraper

# Example location (san francisco + .1)
lat = 37.6739
lon = -122.531297
category = 'cannabis_clinics'

scraper = create_scraper(
    city=None,
    radius=100,
    category=category, 
    coordinates=(lat, lon)
)

In [6]:
scraper.search()

## Create ModelMap (Lens)

From center point and given radius, create static perceptron distribution. 

> Overlap required.  Search will choose model with most data available for training

#### Create Grid

Create a 2d square matrix of coordinates.

X,Y Square matrix of points where:

> X + Y = 2 * 2 * max_radius

> num_nodes = round(X/point_radius) **round up**

> dist = X / num_nodes

> total_num_nodes = num_nodes^2

In [133]:
from math import ceil
import numpy as np


def get_grid_coord(c_lat, c_lon, point_radius, max_radius):
    num_nodes_row = calc_nodes_per_row(point_radius=point_radius, max_radius=max_radius)
    
    latitudes = generate_row(center=c_lat, point_radius=point_radius, max_radius=max_radius)
    longitudes = generate_row(center=c_lon, point_radius=point_radius, max_radius=max_radius)
    
    rows = []
    for latitude in latitudes:
        rows += list(zip(longitudes, [latitude]*len(longitudes)))
        
    return rows


def calc_nodes_per_row(point_radius, max_radius):
    X = 2 * max_radius # X = Y; square matrix
    num_nodes_per_row = int(ceil(X/point_radius)) # no partial nodes, must be odd number
    if num_nodes_per_row % 2 == 0:
        num_nodes_per_row += 1
    return num_nodes_per_row


def calc_distance_between_nodes(num_nodes, max_radius, scale_factor=1):
    return max_radius/(num_nodes-1) * scale_factor


def generate_row(center, point_radius, max_radius):
    num_nodes = calc_nodes_per_row(point_radius=point_radius, max_radius=max_radius)
    
    # Validate that point-radius > dist
    dist = calc_distance_between_nodes(num_nodes=num_nodes, max_radius=max_radius)
    assert point_radius > dist
    
    left = center - max_radius/2
    right = center + max_radius/2
    row = np.linspace(left, right, num_nodes)
    # Validate that longitude_vector same length = num_nodes
    assert len(row) == num_nodes
    return row

# can approach this a couple ways.  One would be to get the two midlines
#    and use the points above and below the center (Y>0, Y<0) to make more lateral rows.

In [135]:
grid = get_grid_coord(
    c_lat = 30,
    c_lon = 31,
    point_radius = 0.5,
    max_radius = 1,
)
display(grid, len(grid))

[(30.5, 29.5),
 (30.75, 29.5),
 (31.0, 29.5),
 (31.25, 29.5),
 (31.5, 29.5),
 (30.5, 29.75),
 (30.75, 29.75),
 (31.0, 29.75),
 (31.25, 29.75),
 (31.5, 29.75),
 (30.5, 30.0),
 (30.75, 30.0),
 (31.0, 30.0),
 (31.25, 30.0),
 (31.5, 30.0),
 (30.5, 30.25),
 (30.75, 30.25),
 (31.0, 30.25),
 (31.25, 30.25),
 (31.5, 30.25),
 (30.5, 30.5),
 (30.75, 30.5),
 (31.0, 30.5),
 (31.25, 30.5),
 (31.5, 30.5)]

25

### Rank best model for prediction

There is some overlap between models.  Starting out, we'll use the model with the most obervations, closest to the target.

### Get data within model limits

Query database for search metadata within it's specified radius

In [136]:
import read_query

In [141]:
read_query.sample_data(
    coordinates = (37.85, -122.4),
    model_radius = 0.8
)

[(37.8739, -122.431297, 'cannabis_clinics', 21),
 (37.8739, -122.531297, 'cannabis_clinics', 0),
 (37.6739, -122.531297, 'cannabis_clinics', 17)]