# Sightseeing in New York City
** Extracting patterns from geolocated venues and events **

Machine learning, and in particular clustering algorithms, can be used to determine which geographical areas are commonly visited and “checked into” by a given user and which areas are not. Such geographical analyses enable a wide range of services, from location-based recommenders to advanced security systems, and in general provide a more personalized user experience. 

I will use these techniques to provide two flavours of predicting analytics: 

First, I will build a simple recommender system which will provide the most trending venues in a given area. In particular, k-means tclustering can be applied to the dataset of geolocated events to partition the map into regions. For each region, we can rank the venues which are most visited. With this information, we can recommend venues and landmarks such as Times Square or the Empire State Building depending of the location of the user.

Second, I’ll determine geographical areas that are specific and personal to each user. In particular, I will use a density-based clustering technique such as DBSCAN to extract the areas where a user usually go. This analysis can be used to determine if a given data point is an _outlier_ with respect to the areas where a user normally checks in. And therefore it can be used to score a "novelty" or "anomaly" factor given the location of a given event

We will analyze this events from a public dataset shared by Gowalla on venues checkins registered between 2008 and 2010. This notebook will cover some typical data science steps:

  - data acquisition
  - data preparation
  - data exploration
  
Thereafter, we will dive into some unsupervised learning techniques: *k-means* and *dbscan* clustering, respectively for recommending popular venues and for determining outliers.

## Imports

In [1]:
# utils
import os

# cassandra driver
from cassandra.cluster import Cluster
from cassandra.cluster import SimpleStatement, ConsistencyLevel

# serialize/deserialize models
import pickle

# augment data
import urllib, urllib.request
import json

In [2]:
# init
datadir = './data'

# connect to cassandra
contact_points = ['cassandra']

cluster = Cluster(contact_points)
session = cluster.connect()

### Extra: augment data information with wikipedia data

Fetch the wikipedia page url of a given topic from wikipedia

In [3]:
def geturl(s):
    s=urllib.parse.quote(s)
    wiki_url = ''
    try:
        url='https://en.wikipedia.org/w/api.php?action=opensearch&search={}&limit=1&format=json'.format(s)
        req = urllib.request.Request(url)
        resp = urllib.request.urlopen(req)
        wiki_url = json.loads(resp.read().decode('utf-8'))[3][0]
    finally:
        return wiki_url

In [4]:
geturl('Wall Street Bull')

'https://en.wikipedia.org/wiki/Wall_Street_Bull'

### Load the model

In [5]:
cql_stmt = "SELECT model from lbsn.models where mid='kmeans'"
rows = session.execute(cql_stmt)
ml = pickle.loads(rows[0].model)

# prepared statement for getting the name of the top venue in a given cluster
cql_prepared = session.prepare("SELECT * from lbsn.kmeans_topvenues where cid= ? LIMIT ?")

### Score the geo-located event (lon, lat) 

- score the co-ordinates against the kmeans model
- get the venue wikipedia url in the cassandra table, if not available,  
- fetch the wikipedia page url of a given topic from wikipedia
- cache it in cassandra for further use

In [6]:
def score(lon, lat):
    cl = ml.predict([[lon, lat]])[0]
    
    keys = cluster.metadata.keyspaces['lbsn'].tables['kmeans_topvenues'].columns.keys()
    rows = session.execute(cql_prepared.bind((cl,1)))
    
    #package result as a dictionary
    d = dict(zip(keys,list(rows[0])))
    
    if d['url'] == None:
        #get the url from wikipedia
        d['url']  = geturl(d['name'])
        
        #cache
        cql_stmt = "UPDATE lbsn.kmeans_topvenues SET url = '{}' WHERE cid = {}".format(d['url'], d['cid'])
        rows = session.execute(cql_stmt)

    return d

In [7]:
score(-74.01, 40.79)

{'cid': 156,
 'count': 25,
 'lat': 40.7759886667,
 'lon': -74.01438205,
 'name': 'Louisa Park',
 'url': 'https://en.wikipedia.org/wiki/Louisa_Parr'}

### Recommender and Rendering output

In [8]:
# Tip:
# The smallest possible templating engine on python, 
# including variable substitutions!

#variables
d = {'action':'open', 'name':'sesame'}

#tamplate engine!
"{action}, {name}!".format(**d)

'open, sesame!'

In [9]:
def html_template(d):
    def link(url, text):
        return '<a href="{}">{}</a>'.format(url, text) if url else text
    
    # url to html tags
    d['url_html'] = link(d['url'], d['name'])
    
    # template!
    tmpl = 'What about visiting &nbsp; {url_html}?'
    
    #render
    output = tmpl.format(**d)
    
    return output

def recommender(lon,lat, format='json', notebook=False):
    d = score(lon, lat)
    
    name = d['name']
    url  = d['url']
    
    # optionally add extra data suggestion 
    # based on the information available
    output = html_template(d) if format=='html'else json.dumps(d)
    
    if notebook:
        from IPython.display import HTML
        return HTML(output)
    else:
        return output

In [10]:
recommender(-74.01, 40.7, 'html', notebook=True)

In [11]:
recommender(-74.97, 41.51, 'html', notebook=True)

In [12]:
recommender(-74, 40.55, 'html')

'What about visiting &nbsp; <a href="https://en.wikipedia.org/wiki/Coney_Island_boardwalk">Coney Island Boardwalk</a>?'

### Building the REST service

In [None]:
from flask import Flask
app = Flask("venue_recommender")

@app.route("/hello")
def hello():
    return "hi, there"I

@app.route("/api/venues/recommender/<lon>,<lat>")
def recommender_api(lon, lat):
        return recommender(float(lon), float(lat), 'html')

app.run(host='0.0.0.0')

Try http://localhost:5000/api/venues/recommender/-74.01,40.7