# Yelp Scraper Part I

Development notebook for business URI search and lens enhancement engine.

## Database Setup

Rather than attempt to keep information globally shared between modules database or messenger/caching services will be used.  For V0.1, an sqlilte implementation is planned.  

In [None]:
# Initialize Database
from db import get_db, get_session, init_db

init_db()  # Creates tables/relationships

## Yelp Fusion Client

This module uses the yelpapi library to manage connections and requests to Yelp's Fusion API.

In [3]:
from scraper_1_urls import load_environment, get_client

load_environment(from_file=True)
client = get_client()

Environment variables set.


## Lens

Lens is a handler of simple perceptron networks whose goal it is to create predictions (forward passes) of expected number of businesses within an area.  The Yelp API will only give a maxiumum of 50 businesses per search.  Lens helps tune the jump distance as the scraper adjusts longitude and lattitude while searching.

Two approaches will be investigated:

1. Attempt to predict the number of businesses in a given area.

2. Attempt to predict the number of NEW businesses in a given area.

V0.1 will look at the first strategy.

In [1]:
from lens import fastmap

### Create, train, and update a perceptron

The layer initialization step can be skipped by passing in a new y-vector (in our case, updated data from nearby zones.

> V0.1 will not use update as some type of copy_on_complete mechanism will be needed.  More logic to ensure that the quality, relations of and in y_train don't deviate significantly.  Otherwise, more training may have to be done to 'undo' bad weights. 

In [2]:
# Test Data
import numpy as np
X=np.array([[0,1,2], [3,4,5], [6,7,8]])
y=np.array([0.5,0.65,1])

model_map = fastmap.ModelMap()

In [3]:
model_map.pin_model(X, y, (20,30))

{'geohash': 'set3f8vk6wjr', 'file_location': '/tmp/set3f8vk6wjr.pkl'}

**At this point,** the model is ready for use.  The returned information needs to be stored so predictions can be called when needed.

## Get All Categories by Country

To help limit and organize the search, as well as provide better data to Lens as the search progresses, we will search by category in each direction (scraper path).

Yelp provides a json download with all their categories [here](https://www.yelp.com/developers/documentation/v3/all_category_list/categories.json)

In [1]:
# Load data
import json

with open('categories.json', 'r') as file:
    categories = json.load(file)

In [2]:
len(categories)

1565

In [22]:
# Get all parent categories
def get_countries(category):
    if 'country_whitelist' in category.keys():
        countries = category['country_whitelist']
        return countries
    return ['ALL']

def get_parents(category):
    parent = category['parents']
    parents = []
    if len(parent) > 1:
        for item in parent:
            if type(item) == str:
                parents.append(item)
    elif len(parent) == 1:
            parents.append(parent[0])
    return parents

countries = list(map(get_countries, categories))
parents = list(map(get_parents, categories))
data = {'parent':parents, 'country':countries}

In [23]:
# Build a dataframe with parent data for starters
import pandas as pd

df = pd.DataFrame(data=data)
df.head(5)

Unnamed: 0,parent,country
0,[localservices],[ALL]
1,[italian],[IT]
2,[bars],[CZ]
3,[food],[ALL]
4,[fashion],[ALL]


In [24]:
# Explode each column
df = df.explode('parent')
df = df.explode('country')
display(df.head(), df.describe())

Unnamed: 0,parent,country
0,localservices,ALL
1,italian,IT
2,bars,CZ
3,food,ALL
4,fashion,ALL


Unnamed: 0,parent,country
count,4102,4127
unique,121,34
top,restaurants,ALL
freq,502,1084


In [28]:
# Clean data
df = df.dropna()
df.describe()

Unnamed: 0,parent,country
count,4102,4102
unique,121,34
top,restaurants,ALL
freq,502,1063


In [36]:
# Save Data
from awstools import s3

# Setup bucket link to AWS
bucket = s3.Bucket('yelp-data-shared-labs18')

# Save locally
filename = 'parent_categories_by_country.json'
df.to_json(filename, orient='records')

# Upload to S3 bucket
bucket.save(filename, 'Tables/'+filename)

parent_categories_by_country.json  157156 / 157156.0  (100.00%)