## Overview

Plotting prospective flyer placement opportunities.


#### Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import json
import inspect
import hashlib
import functools
from pathlib import Path
import urllib.parse as uparse
from collections import Counter
import itertools

import requests
from pprint import pprint
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

import spacy
import gensim
import gensim.downloader
import inflect # https://github.com/jazzband/inflect
from fuzzywuzzy import fuzz

import googlemaps
import geopy
import geopy.distance as geodist

import folium
import folium.plugins
import plotly.express as px
import plotly.graph_objects as go

from bing_api import Bing
from here_api import Here
from _config import config
px.set_mapbox_access_token(config.MAPBOX_TOKEN)

## Data

The data comes from three main sources, but is best explained in stages.

#### Stage 0. Speculative list compilation
At this stage, a simple list of locations that may have public bulletin boards was compiled. Initially, the a simple mention frequency was going to be used to filter out the most likely candidates, however, it quickly became clear that most sites were reposting the same ideas from a [single source](http://www.encounter.org/files/2914/3881/8878/Top_20_places_to_find_bulletin_boards_in_a_typical_town.pdf). As such, the list was filtered down to unique values and assigned a subjective interpretation of query difficultly. From there, non-rigorous experimentation was performed to further filter down the list by searching the keyword on [google maps](https://www.google.com/maps) and observing consistency and quality of results.

#### Locations

<details>
<summary>Locations</summary>

1. Grocery Stores
2. Libraries
3. Gyms/Recreational Facilities
4. Churches 
5. Laundromats
6. Coffee Shops
7. Break Rooms/ Waiting Rooms/Lunch Rooms
8. Factories
9. Community Centers
10. Union Halls
11. Beauty Salons
12. Bookstores
13. Restaurants/Bars
14. Convenience Stores
15. Smaller Shopping Centers
16. College and University Common Areas
17. Shops near universities and colleges (the more they target college students, the more likely it is that they will have a board or wall of some kind)
18. Music Stores
19. Apartment complexes
20. Pharmacies

National Chains that often have message boards:
* Qdoba
* Panera Bread
* Caribou Coffee
* Barnes & Noble
* Whole Foods
* Pot Belly Sandwich Shops
* Starbucks
* Jimmy Johns
* Hy-Vee


```
0               Grocery Store
1                     Library
2                         Gym
3       Recreational Facility
4                      Church
5                  Laundromat
6                 Coffee Shop
11           Community Center
12                 Union Hall
13               Beauty Salon
14                  Bookstore
15                 Restaurant
16                        Bar
17          Convenience Store
21                Music Store
22          Apartment complex
23                   Pharmacy
24                      Qdoba
25               Panera Bread
26             Caribou Coffee
27             Barnes & Noble
28                 Whole Food
29    Pot Belly Sandwich Shop
30                   Starbuck
31               Jimmy John's
32                     Hy-Vee
33                Post Office
36                  Town Hall
37                Barber Shop
38              Beauty Parlor
39                 Nail Salon
40            Ice Cream Stand
41                Supermarket
42                    College
43                       Mall
45                     Doctor
47                    Dentist
49                Gas Station
51           Auto Repair Shop
52            Day Care Center
53        Chamber of Commerce
54                       Bank
55               Credit Union
56             Hardware Store
57             Fitness Center
```

</details>

Sources: 

http://www.encounter.org/files/2914/3881/8878/Top_20_places_to_find_bulletin_boards_in_a_typical_town.pdf

https://www.psprint.com/resources/ultimate-list-of-places-to-distribute-your-flyers/

https://mikecooney.net/22-great-places-hang-flyers/

https://grab-its.com/tips.php

#### Globals

In [3]:
gmaps = googlemaps.Client(key=config.GMAPS_KEY, timeout=10, retry_over_query_limit=False)

In [4]:
RAD=40000 # 40000m ≈ 25 miles
METERMILE = 1609.344

In [5]:
# Subjective groupings, can be improved by instead grouping on co-occurance frequency
place_groups = {
    'food':       ['grocery_or_supermarket','bakery','food','supermarket'],
    'dining_out': ['cafe','restaurant','meal_takeaway','meal_delivery'],#new
    'alcohol':    ['bar', 'night_club', 'liquor_store'], #new
    'park_camp':  ['park','campground','amusement_park'],#new
    'academic':   ['library','university','school', 'book_store'],
    'health':     ['gym', 'pharmacy', 'doctor', 'dentist', 'health'],
    'beauty':     ['beauty_salon','hair_care','spa'], #new
    'finance':    ['finance', 'bank', 'atm','accounting'],
    'automotive': ['gas_station', 'car_repair','car_wash, gas_station','convenience_store','car_dealer'],
    'government': ['local_government_office','post_office','city_hall','courthouse'],
    'services':   ['general_contractor','laundry','real_estate_agency','florist'],
    'stores':     ['electronics_store','hardware_store','department_store','shopping_mall','store'],
    'religious':  ['church','place_of_worship'],#new
    'other':      ['point_of_interest','establishment','tourist_attraction']
}
pg_rev = {x:k for k,v in place_groups.items() for x in v}

In [6]:
catcolico = [
    ('food','red','shopping-basket'),
    ('dining_out','orange','cutlery'),
    ('alcohol','purple','glass'),
    ('park_camp','darkgreen','tree'),
    ('academic','blue','graduation-cap'),
    ('health','lightred','heartbeat'),
    ('beauty','lightblue','scissors'),
    ('finance','green','dollar'),
    ('automotive','pink','car'),
    ('government','beige','gavel'),
    ('services','darkpurple','cubes'),
    ('stores','darkred','shopping-cart'),
    ('religious','lightgray','cloud'),
    ('community','cadetblue','universal-access'), # special; not in place groups
    ('other','gray','question')]

color_map = {cat:col for cat,col,_ in catcolico}
faicon_map = {cat:ico for cat,_,ico in catcolico}

In [None]:
# Removed icons: [,,,,,,,'bell',,'info-circle']
# Unused colors:  ['darkblue', ,'white','lightgreen','black'] 'cadetblue'
# group_glyicons=['cutlery','glass','education','heart-empty','usd','road','flag','bell','shopping-cart','info-sign']
# glyicon_map = {k:v for k,v in zip(place_groups.keys(),group_glyicons)}
# group_faicons=['cutlery','glass','graduation-cap','heartbeat','dollar','car','gavel','bell','shopping-cart','info-circle']# 

In [6]:
valid_colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred','lightred', 
                'beige', 'darkblue', 'darkgreen', 'cadetblue','darkpurple','white',
                'pink', 'lightblue', 'lightgreen','gray', 'black', 'lightgray']

In [7]:
all_types=['accounting', 'airport', 'amusement_park', 'aquarium',
       'art_gallery', 'atm', 'bakery', 'bank', 'bar', 'beauty_salon',
       'bicycle_store', 'book_store', 'bowling_alley', 'bus_station',
       'cafe', 'campground', 'car_dealer', 'car_rental', 'car_repair',
       'car_wash', 'casino', 'cemetery', 'church', 'city_hall',
       'clothing_store', 'convenience_store', 'courthouse', 'dentist',
       'department_store', 'doctor', 'drugstore', 'electrician',
       'electronics_store', 'embassy', 'fire_station', 'florist',
       'funeral_home', 'furniture_store', 'gas_station',
       'grocery_or_supermarket', 'gym', 'hair_care', 'hardware_store',
       'hindu_temple', 'home_goods_store', 'hospital', 'insurance_agency',
       'jewelry_store', 'laundry', 'lawyer', 'library',
       'light_rail_station', 'liquor_store', 'local_government_office',
       'locksmith', 'lodging', 'meal_delivery', 'meal_takeaway', 'mosque',
       'movie_rental', 'movie_theater', 'moving_company', 'museum',
       'night_club', 'painter', 'park', 'parking', 'pet_store',
       'pharmacy', 'physiotherapist', 'plumber', 'police', 'post_office',
       'primary_school', 'real_estate_agency', 'restaurant',
       'roofing_contractor', 'rv_park', 'school', 'secondary_school',
       'shoe_store', 'shopping_mall', 'spa', 'stadium', 'storage',
       'store', 'subway_station', 'supermarket', 'synagogue',
       'taxi_stand', 'tourist_attraction', 'train_station',
       'transit_station', 'travel_agency', 'university',
       'veterinary_care', 'zoo']

keeptypes = ['book_store','bus_station','city_hall','convenience_store','gas_station',
             'laundry','library', 'local_government_office', 'secondary_school','university']

Available icons:

https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css

https://fontawesome.com/v4.7.0/icons/

#### IO Functions

In [7]:
def keyword_path(keyword, fpath=None):
    # make keyword filename safe
    kw = re.sub(r"[\.\-';:&]",'',keyword).lower().replace(' ','_') 
    file = (Path(fpath)/kw).with_suffix('.json')
    return file

def validate_open(fname, fpath=None, no_overwrite=True):
    file = keyword_path(fname, fpath)
    if file.exists() and no_overwrite:
        raise FileExistsError(f'{file.as_posix()} exists, overwrite not permitted')
    return open(file, 'w', encoding='utf-8')

def read_json(fname, fpath=None):
    file = keyword_path(fname, fpath)
    # No safe handling for missing files
    with open(file, 'r', encoding='utf-8') as f:
        return json.load(f)
    
def concat_jsons(dirpath, outname):
    dirpath = Path(dirpath)
    files = [*dirpath.iterdir()]
    jlist = []
    for f in files:
        with f.open('r') as fp:
            jsf = json.load(fp)
            jsf['file_stem'] = f.stem
            jlist.append(jsf)
    json.dump(jlist,(dirpath/outname).open('w'))
    return json.load((dirpath/outname).open('r'))

In [8]:
def try_jsave(saveobj, fpath):
    try:
        if saveobj and fpath is not None:
            with open(fpath,'w', encoding='utf-8') as fp: 
                json.dump(saveobj, fp, ensure_ascii=False)
    except Exception as e:
        print('ERROR:', e, e.args)
    finally:
        return saveobj

#### Utility Functions

In [9]:
HASHLOG=Path(config.paths.HASHLOG)

In [10]:
def hsfn(fn,*args,**kwargs):
    """Create a hash value for any function, parameters combination"""
    hlog = HASHLOG.read_text()
    dryrun = kwargs.pop('dryrun',False)
    bound = inspect.signature(fn).bind(*args,**kwargs)
    storeargs = {'module':fn.__module__,'name':fn.__name__,**bound.arguments}
    jstr = json.dumps(storeargs,default=str)
    
    s256 = hashlib.sha256(jstr.encode())
    hsh = s256.hexdigest()+'\n'

    if hsh not in hlog:
        HASHLOG.open(mode='a').write(hsh)
        if not dryrun:
            return fn(*args, **kwargs)
    else:
        print(f'Skipping... hashed fncall found ({hsh[:7]}...)')
        return None

Several things can go wrong with the hashing function:
1. It depends on EXACT parameter matches or it will be an entirely different hash value.
    * If the 6th decimal on a longitude/latitude value is off, new hash
    * The same is true for any changes to radius, despite undoubtedly many overlapping results.
    * Inconsistent capitalization of keywords would also cause duplication.
    * The order in which keywords are passed also matters. This can be fixed with a sort, however.
    * If 98 out of 99 destination distances overlap, but 1 differs...new entry
2. There is no 'closeness' metric, a call to `gmaps.places_nearby` with a small parameter change is no more different than a call to `gmaps.distance_matrix` or even `np.random.randint`. This could be partially alleviated by different storage files for different functions.
3. It can't account for functionally equivalent statements such as `geopy.GoogleV3.reverse` and `gmaps.reverse_geocode`
4. If the log file is corrupted or lost, all calls will need to be 'dry run' again with identical params to populate it.

Problem 1 could potentially be made slightly better by levenshtein distance of param strings prior to encoding.

A far more robust, maintainable, and likely performant solution would be to simply use a MongoDB server and pymongo.

In [70]:
def hashstore(func):
    """Works for top-level functions, otherwise not useful"""
    @functools.wraps(func)
    def wrapper_hashstore(*args, **kwargs):
        hlog = HASHLOG.read_text()
        hsh = hsfn(func,*args,**kwargs)
        if hsh not in hlog:
            HASHLOG.open(mode='a').write(hsh)
            return func(*args, **kwargs)
        else:
            print(f'hashed fncall found ({hsh[:7]}...), skipping...')
    return wrapper_hashstore

In [11]:
def partition_groups(iterable, n_groups=None, max_group_size=None):
    assert (n_groups or max_group_size) is not None, 'exactly one of `n_groups` or `max_group_size` must be provided'
    n_groups = n_groups if n_groups is not None else np.ceil(len(iterable)/(max_group_size))
    return np.array_split(iterable, n_groups)

#### Maps Api Functions

Searching for a single place by using the ID allows you to choose which [fields](https://developers.google.com/places/web-service/details#fields) are returned and costs less. This uses `gmaps.place()` if searching by id, or `gmaps.find_place()` if finding by text query.

https://developers.google.com/places/web-service/search#FindPlaceRequests

https://cloud.google.com/maps-platform/pricing/sheet/

https://developers.google.com/maps/billing/gmp-billing#data-skus

https://developers.google.com/maps/documentation/distance-matrix/intro#DistanceMatrixResponses

Based on the [gmaps billing calculator](https://mapsplatformtransition.withgoogle.com/calculator), place details is the most expensive API call for basic tier responses, costing ~$17/1000 requests. It provides little benefit over a broad area search, and should be avoided when possible.

In [65]:
def ggeocode(ggeo_func, location, fpath=None, **kwargs):
    """Geo-encoding if reverse=True, location is (lat,long) otherwise a human readable address"""
    resp = hsfn(ggeo_func, location, exactly_one=True, **kwargs)

    return resp

In [13]:
def gplaceid(query, coords, fpath=None, radius=20, store_hash=False, **kwargs):
    """Find nearest place_id that matches a given query"""
    lb = 'circle:{}@{},{}'.format(radius,*coords)
    if store_hash:
        resp = hsfn(gmaps.find_place, query, input_type='textquery', location_bias=lb, language='en', **kwargs)
    else:
        resp = gmaps.find_place(query, input_type='textquery', location_bias=lb, language='en')
    return try_jsave(resp,fpath)

In [14]:
# TODO: make a list of place_ids that have already been detailed
# skip the place lookup if information already is available
# fill in from the database with a notification that it was pulled locally
# gplace_details_df
def gpdetails_df(df, query_col, coord_col, fpath=None, radius=20, keep_fields=None, id_only=True,**kwargs):
    """Returns details or place ids of the nearest place fitting given critera 
    
    Parameters
    ----------
    df : Pandas.DataFrame, 
        DataFrame containing infomation to pass to gmaps find place API
    query_col : str, 
        Column in `df` containing query strings
    coord_col : str, 
        Column in `df` containing [latitude,longitude] coordinates
    fpath : str, (default: None)
        Path to save the json API responses
    radius : int, (default: 20)
        Search radius(meters) used to find the nearest matching result
    keep_fields : list[str], (default: ['formatted_address','geometry', 'icon', 'name',
        'permanently_closed', 'place_id', 'plus_code', 'type', 'url', 'vicinity'])
        Fields to keep from the place details response
    id_only : boolean, (default: True)
        If true, only place_ids will be returned, which is a non-billed operation
        
    Returns
    -------
    List of json resposnes or place_ids
    """
    dfm = df[[query_col,coord_col]]
    places = [gplaceid(q, co, fpath=fpath, radius=radius) for q,co in dfm.values]
    if id_only:
        return places
    else: # explict else to emphasize if and only if 
        if fpath is None:
             raise IOError('Expensive query. `fpath` before proceeding')
        if keep_fields is None:
            #'adr_address','address_component','photo','utc_offset'
            keep_fields = ['formatted_address', 'geometry', 'icon', 'name', 'permanently_closed', 
                           'place_id', 'plus_code', 'type', 'url', 'vicinity']

        results = []
        for pid in places:
            first_pid = pid['candidates'][0]['place_id']
            resp = hsfn(gmaps.place, place_id=first_pid, fields=keep_fields, language='en')
            if resp is not None:
                results.append(resp)

        return try_jsave(results,fpath)

In [15]:
# gdist_save
def gdistances(destinations, origin, fpath, max_ele=100, **kwargs):
    """Calculate distance between up to 99 destinations from a single origin point"""
    dest_parts = partition_groups(destinations, max_group_size=(max_ele-1)) # subtract 1 for Origin
    origin = tuple(origin)
    results = []
    for part in dest_parts:
        resp = hsfn(gmaps.distance_matrix, origins=origin, destinations=part, units='imperial', language='en', **kwargs)
        if resp is not None:
            results.append(resp)

    return try_jsave(results,fpath)

In [16]:
def gdistances_df(df, dests_col, origins_col, fpath, max_ele=100, **kwargs):
    ori_dests = df.groupby(df[origins_col].astype(str))[dests_col].apply(list)
    results = [gdistances(ds, destring_list(o,' '),max_ele=max_ele) for o,ds in ori_dests.items()]
    return try_jsave(results,fpath)

In [17]:
# gquery_save
def gsearchNB(query, coords, fpath=None, radius=40000, **kwargs):
    """Search for places near a location based on a text query. 
    Finds a maximum of 20 results per query,coords pair"""
    coords = tuple(coords)
    resp = hsfn(gmaps.places_nearby, location=coords, radius=radius, keyword=query, language='en', **kwargs)
    if resp is not None:
        resp.update({'query':query,'origin':coords,'radius':radius})
    return try_jsave(resp,fpath)

In [18]:
# gquery_df_save
def gsearchNB_df(df, query_col, coord_col, fpath, radius=40000, **kwargs):
    """Search for places near locations based on a text queries using values from a DataFrame
    
    Parameters
    ----------
    df : Pandas.DataFrame, 
        DataFrame containing infomation to pass to gmaps find place API
    query_col : str, 
        Column in `df` containing query strings
    coord_col : str, 
        Column in `df` containing [latitude,longitude] coordinates
    fpath : str, (default: None)
        Path to save the json API responses
    radius : int, (default: 20)
        Search radius(meters) used to find the nearest matching result
    """
    dfm = df[[query_col,coord_col]]
    results = []
    for query,coords in dfm.values:
        coords = tuple(coords) # cast as tuple in case it is np array
        resp = hsfn(gmaps.places_nearby, location=coords, radius=radius, keyword=query, language='en', **kwargs)
        if resp is not None:
            resp.update({'query':query,'origin':coords,'radius':radius})
            results.append(resp)
            
    return try_jsave(results,fpath)

In [289]:
def extract_query_fields(query, fpath=None):
    response = read_json(query, fpath=fpath)
    places = response['results']
    for place in places:
        yield {
            'place_name': place['name'],
            'latlong': tuple(place['geometry']['location'].values()),
            'vicinity': place['vicinity'],
            'keyword': query,
            'types': place['types'],
            'rating': place['rating'],
            'n_ratings': place['user_ratings_total'],
            'place_id': place['place_id']
        }

In [295]:
def extract_dist_fields(fname, fpath=None):
    gdist_dict = read_json(fname,fpath)
    for dest_addr, dist_data in zip(gdist_dict['destination_addresses'], gdist_dict['rows'][0]['elements']):
        yield {
            'dest_addr':dest_addr,
            'meters': dist_data['distance']['value'],
            'seconds': dist_data['duration']['value']
            #'miles': dist_data['distance']['text'],
            #'minutes': dist_data['duration']['text'],
        }        

In [81]:
def extract_detail_fields(pdetails_list):
    for pdt in pdetails_list:
        yield {
            'place_name': pdt['name'],
            'address':pdt['formatted_address'].strip(', USA'),
            'latlng': np.round([*pdt['geometry']['location'].values()],6),
            'types': pdt['types'],
            'icon': pdt['icon'].split('/')[-1], # PREFIX: https://maps.gstatic.com/mapfiles/place_api/icons/
            'global_pcode': pdt['plus_code']['global_code'],
            'place_id': pdt['place_id'],
            'cid': pdt['url'].split('=')[-1], # PREFIX: https://maps.google.com/?cid=
    }

## Data Preprocessing

Prepare manually collected list of potential flyer locations.

In [19]:
def gkgsearch(query,limit=10,indent=True,apikey=None):
    b_url = 'https://kgsearch.googleapis.com'
    pth = '/v1/'
    rsc = 'entities:search'
    url_path = b_url+pth+rsc
    params = {
        'query': query,
        'limit': limit,
        'indent': indent,
        'key': apikey,
    }
    resp = requests.get(url_path, params=params)
    jresp = resp.json()
    return jresp.get('itemListElement')

In [18]:
locs_df = pd.read_csv('data/csv/flyerlocs.csv').drop_duplicates('Places')
locs_df.head()

Unnamed: 0,Places,HardQuery,Comments
0,Grocery Stores,,
1,Libraries,,
2,Gyms,,
3,Recreational Facilities,,
4,Churches,,


In [20]:
def build_kg(queries, lim=1, key=None):
    df_kg = pd.concat([json_normalize(gkgsearch(q, limit=lim, apikey=key),sep='_').assign(keyword=q) 
        for q in queries], sort=False).reset_index(drop=True)

    # Clean names and drop unused columns
    df_kg = (df_kg.drop(columns=[
        '@type','result_image_contentUrl','result_image_url','result_detailedDescription_license'
    ]).rename(lambda x: re.sub(r'result_?|@|etailedD|article','',x).lower(), axis=1))
    # Re-order columns
    return df_kg[['score','keyword','name','description','type','description_body','url','description_url','id']]

In [20]:
kg_df = build_kg(locs_df.Places, key=config.GKG_KEY); kg_df.head()

Unnamed: 0,score,keyword,name,description,type,description_body,url,description_url,id
0,29.00992,Grocery Stores,Walmart,Retail company,"[Organization, Thing, Corporation]",Walmart Inc. is an American multinational reta...,http://www.walmart.com/,https://en.wikipedia.org/wiki/Walmart,kg:/m/0841v
1,39.676849,Libraries,Greenwood Publishing Group,Publishing company,"[Organization, Thing, Corporation]",ABC-CLIO/Greenwood is an educational and acade...,,https://en.wikipedia.org/wiki/Greenwood_Publis...,kg:/m/03npz5m
2,1151.340088,Gyms,ClassPass,Company,"[Thing, Corporation, Organization]",ClassPass Inc. an American company which provi...,,https://en.wikipedia.org/wiki/ClassPass,kg:/m/0130gnq4
3,1.534503,Recreational Facilities,Bombardier Recreational Products,Manufacturing company,"[Corporation, Thing, Organization]",BRP Inc. is a Canadian company making various ...,http://www.brp.com,https://en.wikipedia.org/wiki/Bombardier_Recre...,kg:/m/0343nb
4,6067.5,Coffee Shops,Tully's Coffee,Retail chain,"[Corporation, Organization, Thing, Restaurant]",Tully's Coffee is an American specialty coffee...,http://tullyscoffeeshops.com/,https://en.wikipedia.org/wiki/Tully's_Coffee,kg:/m/032716


In [21]:
def find_named_ents(df, min_score=5000, min_kw_ratio=85, desc_kw='company', types_filter=None):
    """Apply multiple masks to attempt to isolate named entities"""
    types_filter = set(types_filter) if types_filter is not None else set(['Organization', 'Corporation'])
    mask_score = df['score'] > min_score
    mask_kw = [fuzz.UWRatio(x,y) > min_kw_ratio for x,y in df[['keyword','name']].values]
    mask_desc = df['description'].str.contains(desc_kw,case=False)
    mask_types = df['type'].apply(lambda x: types_filter.issubset(x))
    
    return df[(mask_score & mask_kw & mask_desc & mask_types)]

In [22]:
find_named_ents(kg_df,min_kw_ratio=85)

Unnamed: 0,score,keyword,name,description,type,description_body,url,description_url,id
18,6114.40918,Qdoba,Qdoba,Restaurant company,"[Thing, Organization, Corporation, Restaurant]",Qdoba Mexican Eats\nis a chain of fast casual ...,http://www.qdoba.com,https://en.wikipedia.org/wiki/Qdoba,kg:/m/05rqh3
19,55637.910156,Panera Bread,Panera Bread,Bakery company,"[Restaurant, Thing, Corporation, Organization]",Panera Bread Company is an American chain stor...,http://www.panerabread.com,https://en.wikipedia.org/wiki/Panera_Bread,kg:/m/03pk18
20,10366.417969,Caribou Coffee,Caribou Coffee,Company,"[Organization, Thing, Corporation, Restaurant]",Caribou Coffee Company is an American coffee c...,,https://en.wikipedia.org/wiki/Caribou_Coffee,kg:/m/03p1q_2
21,94895.757812,Barnes & Noble,Barnes &amp; Noble,Retail outlet company,"[Corporation, Organization, Thing]","Barnes &amp; Noble, Inc., is an American books...",http://www.barnesandnobleinc.com/,https://en.wikipedia.org/wiki/Barnes_%26_Noble,kg:/m/01b7dt
22,28937.314453,Whole Foods,Whole Foods Market,Supermarket company,"[Organization, LocalBusiness, Corporation, Thing]",Whole Foods Market Inc. is an American multina...,,https://en.wikipedia.org/wiki/Whole_Foods_Market,kg:/m/02xf2l
23,29544.181641,Potbelly Sandwich Shop,Potbelly Sandwich Shop,Restaurant company,"[Restaurant, Thing, Corporation, Organization]",Potbelly Corporation is a publicly traded rest...,http://www.potbelly.com/,https://en.wikipedia.org/wiki/Potbelly_Sandwic...,kg:/m/05g_n0
24,25809.320312,Starbucks,Starbucks,Coffee company,"[Organization, Thing, Corporation, Restaurant]",Starbucks Corporation is an American coffee co...,http://www.starbucks.com,https://en.wikipedia.org/wiki/Starbucks,kg:/m/018c_r
25,52278.152344,Jimmy John's,Jimmy John's,Fast food restaurant company,"[Thing, Organization, Restaurant, Corporation]","Jimmy John's Franchise, LLC is an American fra...",http://www.jimmyjohns.com,https://en.wikipedia.org/wiki/Jimmy_John's,kg:/m/05pqt1
26,18910.541016,Hy-Vee,Hy-Vee,Supermarket company,"[Organization, Thing, Corporation]",Hy-Vee is a chain of more than 245 supermarket...,http://www.hy-vee.com,https://en.wikipedia.org/wiki/Hy-Vee,kg:/m/02vg5b


In [22]:
def proc_placelist(df_locs, outfile=None, ents=None):
    df = df_locs.copy()
    df['hard_query'] = df['HardQuery'].notna()
    df['named_ent'] = df['Places'].isin(ents)
    # Convert to non-entities to singular form; plural form alters query results
    ie = inflect.engine()
    df['singular'] = df.apply(lambda x: ie.singular_noun(x['Places']) if not x['named_ent'] else x['Places'],axis=1) 
    df = df[['Places','singular','named_ent','hard_query']]
    df['singular'] = df['singular'].where(df['singular'] != False, df['Places'])
    df.loc[:,['Places','singular']] = df[['Places','singular']].applymap(str.strip)

    if outfile is not None:
        df.to_csv(outfile,index=False)
    return df

In [24]:
locs_df = proc_placelist(locs_df,'data/csv/location_list.csv',find_named_ents(kg_df).keyword)

In [25]:
locs_df = pd.read_csv('data/csv/location_list.csv')
qlist = locs_df['singular'].where(~(locs_df['hard_query'])).dropna().drop_duplicates() # Omit awkwardly phrased queries

### Initial Query
The first call to any API. This begins gathering results for all of the items in the list of potential establishments

In [23]:
def update_groups(df, groupmap, types_col='types',name_col='place_name'):
    first_types = df[types_col].str[0]
    uncat_types = first_types.isin(['point_of_interest','establishment'])
    # Recreation/Community Centers
    mask_recr = uncat_types & df[name_col].str.contains(r'rec|community',case=False)
    # Chambers of Commerce
    mask_cmbr = uncat_types & df[name_col].str.contains(r'chamber',case=False)
    place_groups = first_types.map(groupmap).mask(mask_recr, 'community').mask(mask_cmbr,'government')
    return place_groups

In [24]:
def destring_list(liststr,resep=r',\s*'):
    # 6 digits of precision is accurate to ~11cm, anything further is likely noise
    # https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude
    return np.round([*map(np.float,re.split(resep,liststr.strip('[( )]')))],6)

In [25]:
def drop_group_rename(proc_df, groupmap=None):
    df = proc_df.copy()
    start_drop = ['icon','id','photos','reference','scope','compound_code','url']
    rename_map = {'query':'keyword', 'ratings_total':'n_ratings', 'name':'place_name', 'formatted_address':'dest_addr'}
    
    df.columns = df.columns.str.split('_').str[-2:].str.join('_').str.replace('location_','').str.strip()
    df = df.drop(columns=start_drop,errors='ignore').rename(columns=rename_map)
    
    if groupmap is not None:
        df['place_group'] = update_groups(df, groupmap, types_col='types', name_col='place_name')
    df['latlong'] = df.apply(lambda x: np.round([x['lat'],x['lng']],6), axis=1)
    
    end_drop = ['lat','lng',*df.filter(like='st_l')]
    df = df.drop(columns=end_drop, errors='ignore')
    
    return df

In [131]:
def dedupe_placeid(df, masterdf_path='data/pdflyers_df.pkl'):
    """Filter duplicate place_ids to reduce API cost"""
    df_pdf = pd.read_pickle('data/pdflyers_df.pkl')
    s0 = df.shape[0]
    s1 = df.drop_duplicates('place_id').shape[0]
    s2 = df[~df['place_id'].isin(df_pdf['place_id'])].shape[0]
    dfu = df[~df['place_id'].isin(df_pdf['place_id'])].drop_duplicates('place_id')
    s3 = dfu.shape[0]
    print('Starting samples:',s0)
    print(f'Duplicates: (Internal: {s0-s1}, External: {s0-s2})')
    print('Remaining unique entries:',s3)
    print('Yield: {:0.2%}, (Drop rate: {:0.2%})'.format((s3/s0),(s0-s3)/s0))
    return dfu

In [28]:
snb_w25k = [gsearchNB(q,config.geo.CENTERW,radius=25000) for q in qlist.str.lower()]
json.dump(snb_w25k,open('data/json/searchNB_W_allKW_r25k.json','w',encoding='utf-8'))

In [21]:
snb_w25k = json.load(open('data/json/searchNB_W_allKW_r25k.json','r',encoding='utf-8'))

In [26]:
def proc_searchNB(fpath):
    with open(fpath,'r',encoding='utf-8') as fp:
        snb = json.load(fp)
    
    df = pd.concat([json_normalize(res,['results'], sep='_').assign(
            **{x:str(res[x]) for x in ['query','origin','radius']}) for res in snb],sort=False)
    
    df = drop_group_rename(df,pg_rev)
    df['origin'] = df.origin.apply(destring_list)
    
    # drop 'open_now' and rearrange columns
    df = df[['place_name', 'latlong', 'vicinity', 'keyword', 'place_group','types',
       'rating', 'n_ratings', 'place_id','global_code', 'price_level', 'origin']]
    
    return df

In [189]:
snb_df = proc_searchNB('data/json/searchNB_W_allKW_r25k.json')
snb_df = dedupe_placeid(snb_df)
#snb_df['place_group'] = update_groups(snb_df,pg_rev)

In [66]:
dists_w25k = gdistances(snb_df.latlong, config.geo.CENTERW, 'data/json/distances_W_allKW_r25k.json')

In [27]:
def add_distances(df_base,dists_fpath):
    """Routing distance is calculated CENTERW -> location, geodesic distance is query origin -> location"""
    df = df_base.copy()
    with open(dists_fpath,'r',encoding='utf-8') as fp:
        distslist = json.load(fp)
    
    df['geodesic_m'] = df.apply(lambda x: geodist.geodesic(x.latlong,x.origin).meters, axis=1)
    
    dists_df = pd.DataFrame([{'dest_addr':y,'travel_m':d['distance']['value'],'travel_secs':d['duration']['value']} 
             for x in distslist for y,d in zip(x['destination_addresses'],x['rows'][0]['elements'])],df.index)
    
    dists_df['dest_addr'] = dists_df['dest_addr'].str.replace(', ?USA|United States','')
    df = pd.concat([df,dists_df],sort=False,axis=1)
    return df

In [28]:
snb_df = add_distances(snb_df,'data/json/distances_W_allKW_r25k.json')

In [196]:
pd.read_pickle('data/pdflyers_df.pkl').append(snb_df, sort=False).to_pickle('data/pdflyers_df.pkl')

In [197]:
df_pdf = pd.read_pickle('data/pdflyers_df.pkl')

In [28]:
def gpcode(coords):
    """Long runtime when API key is not provided"""
    url_path='https://plus.codes/api?'
    params = {'address':'{},{}'.format(*coords)}
    resp = requests.get(url_path, params)
    return resp.json()

In [None]:
globcodes = df_pdf.latlong.apply(gpcode)
df_pdf['global_code'] = df_pdf.global_code.fillna(globcodes.apply(lambda x: x['plus_code']['global_code']))

In [216]:
df_pdf.to_pickle('data/pdflyers_df.pkl')

### Placed Flyers

Start with raw export from the Map Markers app

In [29]:
def proc_visited_locs(flyer_fpath, posdf_path='data/open_pos_df.pkl'):
    """geocoder: ('here','gmaps','bing', or None)"""
    df = pd.read_csv(flyer_fpath)

    # ex. Marker 1-gas station-Holiday -> keyword:gas station, place_name:Holiday
    df[['keyword','place_name']] = df.apply(lambda x: x.Title.split('-'), result_type='expand',axis=1).iloc[:,1:]

    # append quantity where missing
    df['Description'] = df['Description'].apply(lambda x: x+'-1' if x=='Yes' else (x+'-0' if x=='No' else x))
    df['n_flyers'] = df['Description'].str.split('-').str[1].astype(np.int)
    
    df['latlong'] = df.apply(lambda x: np.round([x.Latitude,x.Longitude],6),axis=1)
    
    # Correct datetime for previously placed flyers
    fill_dts = pd.Series(['10/16/2019 11:46','10/16/2019 14:23','10/19/2019 18:12'])
    df['Timestamp'] = df['Timestamp'].mask(lambda x: x.str.split().str[0] == '11/10/2019',fill_dts).pipe(pd.to_datetime)

    # reorder and select desired columns
    df = df[['Timestamp','place_name','keyword','n_flyers', 'latlong']]
    
    df['geodesic_m'] = df['latlong'].apply(lambda x: geodist.geodesic(config.geo.CENTERW,x).meters)
    
    # filter open positions down to 100k meters from shop
    df_nbopos = pd.read_pickle(posdf_path).query('geodesic_m < 100000')
    
    mindex = df['latlong'].apply(lambda x: np.argmin([geodist.geodesic(x, v).meters for v in df_nbopos['latlong']]))
    df[['nearest_position','nearest_position_mi']] = pd.DataFrame(
        [(f"{s.Location_Host}: {s.Position}",geodist.geodesic(l,s['latlong']).miles) 
         for l,(_,s) in zip(df['latlong'],df_nbopos.iloc[mindex].iterrows())])
    
    return df

In [130]:
pd.read_csv('data/csv/placed_day2.csv').head()

Unnamed: 0,Folder name,Folder color,Latitude,Longitude,Title,Description,Color,Phone number,Timestamp
0,Day_00,ff2196f3,45.130812,-93.355416,Flyer 0.0-gas station-Kwik Trip,Yes,ff2196f3,,11/10/2019 12:46
1,Day_00,ff2196f3,45.317972,-93.935816,Flyer 0.1-park-Lake Maria State Park,Yes,ff2196f3,,11/10/2019 12:48
2,Day_00,ff2196f3,45.136333,-93.16928,Flyer 0.2-laundry-Maytag Laundry,Yes,ff2196f3,,11/10/2019 12:50
3,,ff71b300,44.985114,-93.183377,Flyer 12-college-McNeal Hall,Yes-2,ff71b300,,11/16/2019 16:06
4,,ff71b300,45.059031,-93.198671,Marker 13-gas station-Minnoco,No,fff44336,,11/16/2019 16:58


In [207]:
df_visit_day2= proc_visited_locs('data/csv/placed_day2.csv','data/open_pos_df.pkl')

In [208]:
df_visit_day2.head()

Unnamed: 0,Timestamp,place_name,keyword,n_flyers,latlong,geodesic_m,nearest_position,nearest_position_mi
0,2019-10-16 11:46:00,Kwik Trip,gas station,1,"[45.130812, -93.355416]",29075.875029,Anoka County Crew: Andover Field Crew Member,6.673407
1,2019-10-16 14:23:00,Lake Maria State Park,park,1,"[45.317972, -93.935816]",78569.048931,Three Rivers Crews: Central Minnesota Field Cr...,18.088296
2,2019-10-19 18:12:00,Maytag Laundry,laundry,1,"[45.136333, -93.16928]",20657.421746,Anoka County Crew: Andover Field Crew Member,8.552833
3,2019-11-16 16:06:00,McNeal Hall,college,2,"[44.985114, -93.183377]",9160.545053,Youth Outdoors Crews #1-2: Youth Outdoors Crew...,1.47906
4,2019-11-16 16:58:00,Minnoco,gas station,0,"[45.059031, -93.198671]",14584.239211,Youth Outdoors Crews #1-2: Youth Outdoors Crew...,6.147447


In [205]:
details_day2= gpdetails_df(df_visit_day2,'place_name','latlong',fpath='data/json/details_visited_day2.json',id_only=False)

In [30]:
def add_details(details_fpath, df_visited):
    with open(details_fpath, 'r', encoding='utf-8') as fp:
        df = json_normalize(json.load(fp), sep='_')
    
    df = drop_group_rename(df,pg_rev)
    df['dest_addr'] = df['dest_addr'].str.replace(', ?USA|United States','')
    df = df.combine_first(df_visited)
    df['Timestamp'] = pd.to_datetime(df['Timestamp'])
    df['n_flyers'] = df['n_flyers'].astype(np.int)
    # reorder columns
    df = df[['Timestamp','place_name','keyword','place_group','types','n_flyers','dest_addr','vicinity',
             'latlong', 'global_code','geodesic_m', 'nearest_position','nearest_position_mi', 'place_id']]
    return df

In [210]:
df_visit_details = add_details('data/json/details_visited_day2.json',df_visit_day2)
df_visit_details.to_pickle('data/visited_details_all_df.pkl')
df_visit_details.head()

Unnamed: 0,Timestamp,place_name,keyword,place_group,types,n_flyers,dest_addr,vicinity,latlong,global_code,geodesic_m,nearest_position,nearest_position_mi,place_id
0,2019-10-16 11:46:00,Kwik Trip #880,gas station,automotive,"[convenience_store, atm, gas_station, finance,...",1,"5801 96th Ave N, Brooklyn Park, MN 55443","5801 96th Avenue North, Brooklyn Park","[45.130751, -93.355403]",86Q84JJV+8R,29075.875029,Anoka County Crew: Andover Field Crew Member,6.673407,ChIJS4-HKik6s1IRU3_pSNE87kY
1,2019-10-16 14:23:00,Lake Maria State Park,park,park_camp,"[campground, tourist_attraction, lodging, park...",1,"11411 Clementa Ave NW, Monticello, MN 55362","11411 Clementa Avenue Northwest, Monticello","[45.315063, -93.951519]",86Q8828X+29,78569.048931,Three Rivers Crews: Central Minnesota Field Cr...,18.088296,ChIJVZOc2gWatFIRpusOTu6RuGs
2,2019-10-19 18:12:00,Maytag Laundry,laundry,services,"[laundry, point_of_interest, establishment]",1,"9010 Griggs Ave, Circle Pines, MN 55014","9010 Griggs Avenue, Circle Pines","[45.1363, -93.169299]",86Q84RPJ+G7,20657.421746,Anoka County Crew: Andover Field Crew Member,8.552833,ChIJy_rCW1oms1IRFFbSVWpCngA
3,2019-11-16 16:06:00,McNeal Hall,college,academic,"[university, point_of_interest, establishment]",2,"1985 Buford Ave, St Paul, MN 55108","1985 Buford Avenue, Saint Paul","[44.984688, -93.183543]",86P8XRM8+VH,9160.545053,Youth Outdoors Crews #1-2: Youth Outdoors Crew...,1.47906,ChIJqQeq7IIss1IRgIzIpO0CI4o
4,2019-11-16 16:58:00,Minnoco XPRESS,gas station,automotive,"[gas_station, point_of_interest, establishment]",0,"574 Old Hwy 8 NW, New Brighton, MN 55112","574 Old Highway 8 Northwest, New Brighton","[45.058945, -93.198574]",86Q83R52+HH,14584.239211,Youth Outdoors Crews #1-2: Youth Outdoors Crew...,6.147447,ChIJq6YIFNAus1IRNafiQLBN1oY


In [34]:
df_visit_details = pd.read_pickle('data/visited_details_all_df.pkl')

In [255]:
nbvisited = gsearchNB_df(df_visit_details,'keyword','latlong',fpath='data/json/searchNB_visited_day2_r25k.json',radius=25000)

In [29]:
df_nbvisit = proc_searchNB('data/json/searchNB_visited_day2_r25k.json')

In [30]:
df_pdf = pd.read_pickle('data/pdflyers_df.pkl')

In [31]:
df_nbvisit.shape, df_nbvisit.drop_duplicates('place_id').shape,df_nbvisit[~df_nbvisit.place_id.isin(df_pdf.place_id)].drop_duplicates('place_id').shape

((520, 12), (133, 12), (12, 12))

Given the growing list of places and the limited set of keywords used, we are experiencing greatly diminished returns from further queries. From an initial 520 places, our non-duplicate yield is only 12, what's more, nearly 400 were duplicates within the query itself.

In [32]:
df_nbvisit = df_nbvisit[~df_nbvisit.place_id.isin(df_pdf.place_id)].drop_duplicates('place_id').reset_index(drop=True)

In [53]:
dists_day2 = gdistances(df_nbvisit.latlong, config.geo.CENTERW,'data/json/distances_W_NBday2_r25k.json')

In [56]:
df_nbvisit = add_distances(df_nbvisit,'data/json/distances_W_NBday2_r25k.json')

In [62]:
pd.read_pickle('data/pdflyers_df.pkl').append(df_nbvisit,sort=False).to_pickle('data/pdflyers_df.pkl')

### Open Positions

In [78]:
def strip_US(address):
    rgx = re.compile(r', ?(USA|United States)')
    return rgx.sub('',address)

def geocode_df(df_geo, coords_col='latlong', addr_col='dest_addr', geocoder='here', reverse=True):
    df = df_geo.copy()
    #usecol,assigncol = (coords_col,addr_col) if reverse else (addr_col,coords_col)
    if geocoder == 'here': 
        geoc = geopy.Here(config.HERE_APPID,config.HERE_APPCODE)
        gc_func = geoc.reverse if reverse else geoc.geocode
        addr_proc_fn = lambda x: strip_US(gc_func(x).raw['Location']['Address']['Label']) # here
    elif geocoder == 'gmaps': 
        geoc = geopy.GoogleV3(config.GMAPS_KEY)
        gc_func = geoc.reverse if reverse else geoc.geocode
        addr_proc_fn = lambda x: strip_US(ggeocode(geoc,x).address) # gmaps
    elif geocoder == 'bing': 
        geoc = geopy.Bing(config.BING_KEY)
        gc_func = geoc.reverse if reverse else geoc.geocode
        addr_proc_fn = lambda x: strip_US(gc_func(x).address) # bing
    else: 
        raise NotImplementedError('Valid options are (here|gmaps|bing)')
    
    if not reverse:
        df[coords_col] = df[addr_col].apply(lambda x: np.round(gc_func(x).point[:2],6))
        return df
    
    df[addr_col] = df[coords_col].apply(addr_proc_fn)
    return df

In [83]:
def proc_openpos(opos_fpath, geocoder='here'):
    # Get open posisions on CCM websie
    co_geo = json.load(open(opos_fpath,'r', encoding='utf-8'))
    opgeo_df = pd.DataFrame([x['properties'] for x in co_geo['features']])
    df = geocode_df(opgeo_df,addr_col='Address', geocoder=geocoder, reverse=False)
    df['geodesic_m'] = df['latlong'].apply(lambda x: geodist.geodesic(config.geo.CENTERW,x).meters)
    return df

In [85]:
oppos_df = proc_openpos('data/geospatial/Currently_Open.csv.geojson',geocoder='gmaps')
oppos_df.to_pickle('data/open_pos_df.pkl')

In [87]:
oppos_df = pd.read_pickle('data/open_pos_df.pkl')

### Nearby open

In [88]:
def build_queries(df_openpos, keep_types=None):
    if keep_types is None:
        # 'government_office' is not a real type, 'local_government_office' resulted in worse search results
        keep_types = ['book_store','city_hall','convenience_store','gas_station', 
                      'laundry', 'library','university', 'government_office']
    
    kt_queries = [x.replace('_', ' ') for x in keep_types]
    nearpos_unq = df_openpos.drop_duplicates('Address').query('geodesic_m < 100000 & geodesic_m > 55')['latlong']
    qmap = [(pos, ktq) for pos in nearpos_unq for ktq in kt_queries]
    return qmap

In [93]:
qmap = build_queries(oppos_df)
nbopenpos = try_jsave([gsearchNB(q, co, radius=15000) for co,q in qmap],'data/json/searchNB_opos_typeKW_r15k.json')

In [133]:
df_nbo = proc_searchNB('data/json/searchNB_opos_typeKW_r15k.json')

In [134]:
df_nbo = dedupe_placeid(df_nbo)

Starting samples: 1775
Duplicates: (Internal: 965, External: 1354)
Remaining unique entries: 327
Yield: 18.42%, (Drop rate: 81.58%)


In [139]:
nbo_dists = gdistances(df_nbo.latlong, config.geo.CENTERW, 'data/json/distances_opos_r15k.json')

In [142]:
df_nbo = add_distances(df_nbo,'data/json/distances_opos_r15k.json')

In [143]:
pd.read_pickle('data/pdflyers_df.pkl').append(df_nbo,sort=False).to_pickle('data/pdflyers_df.pkl')

In [None]:
nearpos_df = pd.read_pickle('data/nearpos_df.pkl')
pdflyers_df = pd.read_pickle('data/pdflyers_df.pkl')
visited_df = pd.read_pickle('data/flyers_df.pkl')
openpos_df = pd.read_pickle('data/open_pos_df.pkl')
placed_df = pd.read_pickle('data/flyers_all_df.pkl')

### Group Name Analysis

In [132]:
types_list = [y for x in pdflyers_df.types for y in x]

In [153]:
flys_df = pdflyers_df.reset_index(drop=True)

In [184]:
types_df = pd.concat([flys_df[['place_name','keyword','place_group']],
           pd.DataFrame(flys_df.types.tolist(), 
                        columns=[f'type_{i}' for i in range(flys_df.types.apply(len).max())])],axis=1)

In [345]:
flys_df[flys_df.place_name.str.contains('[^\w]Rec(reation)? center', case=False)].head(5)


This pattern has match groups. To actually get the groups, use str.extract.



Unnamed: 0,place_name,latlong,vicinity,dest_addr,keyword,types,rating,n_ratings,place_id,place_group,geodesic_meters,travel_meters,travel_secs,origin
64,Griggs Recreation Center,"[44.965402, -93.150134]","1188 Hubbard Ave, St Paul","1188 Hubbard Ave, St Paul, MN 55104",Recreational Facility,"[point_of_interest, establishment]",4.5,26,ChIJ469k-08rs1IR99hEZOzo2SY,other,19910.362355,24534,1572,"[45.143366, -93.120994]"
65,St Paul Recreation Center,"[44.97146, -93.048825]","1020 Duluth St, St Paul","1020 Duluth St, St Paul, MN 55106",Recreational Facility,"[point_of_interest, establishment]",4.2,11,ChIJiUw508TUslIRjc6dj4W9r1E,other,19932.219256,30322,1562,"[45.143366, -93.120994]"
66,Wilder Recreation Center,"[44.969889, -93.076921]","958 Jessie St, St Paul","958 Jessie St, St Paul, MN 55130",Recreational Facility,"[point_of_interest, establishment]",4.1,20,ChIJgQGmSTjVslIREh7SevWVKgk,other,19589.108651,28437,1318,"[45.143366, -93.120994]"
67,Saint Paul Recreation Center,"[44.952654, -93.184722]","2000 St Anthony Ave, St Paul","2000 St Anthony Ave, St Paul, MN 55104",Recreational Facility,"[point_of_interest, establishment]",2.0,1,ChIJ5RHMFPwp9ocRtZIppZBB2Dk,other,21780.841782,28586,1662,"[45.143366, -93.120994]"
68,Rice Recreation Center,"[44.972775, -93.110451]","1021 Marion St, St Paul","1021 Marion St, St Paul, MN 55117",Recreational Facility,"[local_government_office, point_of_interest, e...",4.0,6,ChIJp4wvr7wqs1IRHC8xfM_Bxho,government,18976.433244,28691,1351,"[45.143366, -93.120994]"


In [338]:
cv = CountVectorizer(ngram_range=(2, 11),stop_words=['establishment','point_of_interest','food','store'],min_df=2)
typec = cv.fit_transform(flys_df.types.str.join(' '))
pd.Series(typec.sum(axis=0).A1,[', '.join(x.split()) for x in cv.get_feature_names()]).sort_values(ascending=False)[:25]

gym, health                                42
cafe, restaurant                           37
pharmacy, health                           31
bar, restaurant                            28
doctor, health                             27
atm, finance                               26
electronics_store, home_goods_store        25
hair_care, health                          23
beauty_salon, hair_care                    22
bakery, meal_takeaway                      21
post_office, finance                       20
church, place_of_worship                   20
meal_delivery, restaurant                  19
meal_takeaway, cafe, restaurant            17
meal_takeaway, cafe                        17
bakery, meal_takeaway, cafe, restaurant    16
bakery, meal_takeaway, cafe                16
cafe, bakery                               16
grocery_or_supermarket, supermarket        15
cafe, bakery, restaurant                   13
dentist, health                            13
bakery, restaurant                

In [337]:
filtypes = flys_df.types.apply(lambda x: [y for y in x if y not in ['establishment','point_of_interest']])
Counter([*itertools.chain(*map(lambda x: itertools.combinations(sorted(x),3),filtypes))]).most_common(15)

[(('cafe', 'food', 'store'), 96),
 (('food', 'restaurant', 'store'), 83),
 (('food', 'grocery_or_supermarket', 'store'), 57),
 (('cafe', 'food', 'restaurant'), 53),
 (('cafe', 'restaurant', 'store'), 53),
 (('bakery', 'food', 'store'), 47),
 (('bakery', 'food', 'restaurant'), 40),
 (('bakery', 'restaurant', 'store'), 40),
 (('bakery', 'cafe', 'food'), 38),
 (('bakery', 'cafe', 'restaurant'), 38),
 (('bakery', 'cafe', 'store'), 38),
 (('health', 'pharmacy', 'store'), 36),
 (('convenience_store', 'food', 'store'), 35),
 (('bar', 'food', 'restaurant'), 33),
 (('food', 'grocery_or_supermarket', 'supermarket'), 29)]

In [6]:
lgvec = spacy.load('en_vectors_web_lg')
kwlist = pd.Series(placedist_df.keyword.unique())
kwnlp = kwlist.apply(lgvec)
keywords_nlp = [lgvec(word) for word in placedist_df.keyword.unique()]

In [13]:
for kw in keywords_nlp:
    print(kw,keywords_nlp[0].similarity(kw))

Grocery Store 1.0
Library 0.4034122878941292
Gym 0.3965229374953233
Recreational Facility 0.3357720535102543
Church 0.303269721075901
Laundromat 0.4230004799201568
Coffee Shop 0.7037528864224344
Community Center 0.364748026759436
Union Hall 0.3333642744282254
Beauty Salon 0.39430854068295423
Bookstore 0.5803768972279834
Restaurant 0.5056170944476143
Bar 0.3874670304787477
Convenience Store 0.8632801182253251
Music Store 0.7018800601240646
Apartment complex 0.3623733390384342
Pharmacy 0.4818274937486199
Qdoba 0.03744478445549931
Panera Bread 0.37760299377586787
Caribou Coffee 0.393244702926105
Barnes & Noble 0.26353069124756967
Whole Foods 0.5179779057685318
Pot Belly Sandwich Shop 0.5705874247342473
Starbucks 0.43917914241117906
Jimmy John's 0.25454560863923054
Hy-Vee 0.060538348127758726
Post Office 0.4014902817160768
Town Hall 0.4078356873927435
Barber Shop 0.6494405166328542
Beauty Parlor 0.37723857795717064
Nail Salon 0.39181234556402345
Ice Cream Stand 0.4188294740789496
Supermark

In [361]:
fasttext.wv.similarity(kwlist[0],kwlist[25])

0.091465205

## Visualization

In [47]:
nearpos_df = pd.read_pickle('data/nearpos_df.pkl')
pdflyers_df = pd.read_pickle('data/pdflyers_df.pkl')
visited_df = pd.read_pickle('data/flyers_df.pkl')
openpos_df = pd.read_pickle('data/open_pos_df.pkl')
visited_all = pd.read_pickle('data/flyers_all_df.pkl')

In [299]:
def cleanplaces_df(df, drop_cols=None):
    if drop_cols is None:
        drop_cols = ['vicinity','rating','n_ratings','place_id','geodesic_meters']
    folium_df = df.drop(columns=drop_cols).copy()
    folium_df['miles'] = (folium_df.travel_meters/METERMILE).round(2)
    folium_df['mins'] = (folium_df.travel_secs/60).round(1)
    folium_df = folium_df.drop(columns=['travel_meters','travel_secs'])
    return folium_df

def cleanvisted_df(df, keep_cols=None):
    if keep_cols is None:
        keep_cols = ['place_name','dest_addr','n_flyers','keyword','types','pinloc','Timestamp']
    return df[keep_cols].copy()

In [300]:
cleandf_pdist = cleanplaces_df(pdflyers_df)
cleandf_visit = cleanvisted_df(visited_df)
near_pos_df = openpos_df[openpos_df.geodesic_meters.lt(42000)]