### Use Case

Our customer has to relocate from his current house in Manhattan to Seattle for job. However they love their current home in Stuytown, East Village, Manhattan a lot. 

The customer got recommendations from his friends and has listed 5 potential apartments in budget in Seattle.

The goal of this project is to determine which of those apartments is in an area most similar to thier current home in Manhattan.

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import folium # plotting library
from bs4 import BeautifulSoup

import json
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [2]:
home_address = "319, Avenue C, New York"
seattle_areas = ['Downtown', 'Central', 'Lake Union', 'Capitol Hill', 'Rainier Valley', 'Beacon Hill', 'West']

Since we will be using the [GeoPy](https://geopy.readthedocs.io/en/) library a lot we create a function like below to call it whenever we need the Latitude and Longitude.

In [3]:
def find_location (address):
    geolocator = Nominatim(user_agent="to_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))
    return (latitude, longitude)

We read the Foursquare developer API credentials from a json file and set other parameters required for calling the API.


In [4]:
creds = json.load(open("fs_creds.json"))
CLIENT_ID = creds['id'] # your Foursquare ID
CLIENT_SECRET = creds['secret'] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 500
radius = 2000 

Given a textual address, this function first fetches the geo coordinates using GeoPy.
Then it finds the locals attractions using the FourSquare API.
It formats the data and then returns a df object.

In [5]:
def whats_near_me(address):
    latitude, longitude = find_location(address)
    
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius, 
            LIMIT)
            
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    liminal = []

    liminal.append([(
            address,
            v['venue']['name'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for liminal in liminal for item in liminal])
    nearby_venues.columns = [
        'Location',
        'Attraction',  
        'Venue Category']
    
    return nearby_venues

In [6]:
manhattan = whats_near_me(home_address)

The geograpical coordinate of 319, Avenue C, New York are 40.7317433, -73.9745526.


In [82]:
manhattan

Unnamed: 0,Location,Attraction,Venue Category
0,"319, Avenue C, New York",The Roost,Bar
1,"319, Avenue C, New York",Boris & Horton,Pet Café
2,"319, Avenue C, New York",Hawa Smoothies & Bubble Tea,Juice Bar
3,"319, Avenue C, New York",Smør,Scandinavian Restaurant
4,"319, Avenue C, New York",Barnyard,Cheese Shop
5,"319, Avenue C, New York",Taverna Kyclades,Greek Restaurant
6,"319, Avenue C, New York",Westville East,American Restaurant
7,"319, Avenue C, New York",Juice Vitality,Juice Bar
8,"319, Avenue C, New York",Malt & Mold,Gourmet Shop
9,"319, Avenue C, New York",Tompkins Square Bagels,Bagel Shop


### One-hot encoder to get one line per location.

In [7]:
# one hot encoding
manh_onehot = pd.get_dummies(manhattan[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manh_onehot['Location'] = manhattan['Location'] 

# move neighborhood column to the first column
fixed_columns = [manh_onehot.columns[-1]] + list(manh_onehot.columns[:-1])
manh_onehot = manh_onehot[fixed_columns]

In [8]:
manh_grouped = manh_onehot.groupby('Location').sum().reset_index()

We repeat the steps performed for Manhattan for all areas the user is considering in Seattle.

In [9]:
for area in seattle_areas:
    whats_near_it = whats_near_me(area + ", Seattle")
        
    onehot = pd.get_dummies(whats_near_it[['Venue Category']], prefix="", prefix_sep="")
    onehot['Location'] = whats_near_it['Location'] 
    

    fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
    onehot = onehot[fixed_columns]
    
    grouped = onehot.groupby('Location').sum().reset_index()
    #manh_grouped = pd.concat([manh_grouped, grouped],join='outer')
    manh_grouped = manh_grouped.append(grouped, ignore_index=True, sort=False)

The geograpical coordinate of Downtown, Seattle are 47.6048723, -122.3334582.
The geograpical coordinate of Central, Seattle are 47.6038321, -122.3300624.
The geograpical coordinate of Lake Union, Seattle are 47.63991865, -122.33555809202913.
The geograpical coordinate of Capitol Hill, Seattle are 47.6238307, -122.3183689.
The geograpical coordinate of Rainier Valley, Seattle are 47.552544, -122.2908894.
The geograpical coordinate of Beacon Hill, Seattle are 47.579257850000005, -122.31159768732729.
The geograpical coordinate of West, Seattle are 47.6038321, -122.3300624.


In [10]:
all_areas = manh_grouped.groupby('Location').sum().reset_index()

In [86]:
all_areas

Unnamed: 0,Location,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,Australian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,...,Electronics Store,Food Truck,Golf Course,Golf Driving Range,Nightclub,Pharmacy,South American Restaurant,Thrift / Vintage Store,Warehouse Store,Wings Joint
0,"319, Avenue C, New York",2,1.0,2.0,1.0,1.0,1.0,1.0,3,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Beacon Hill, Seattle",1,0.0,0.0,0.0,0.0,2.0,1.0,3,1,...,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,"Capitol Hill, Seattle",1,0.0,0.0,1.0,0.0,0.0,0.0,4,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Central, Seattle",1,0.0,0.0,0.0,0.0,0.0,0.0,4,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Downtown, Seattle",1,0.0,0.0,0.0,0.0,0.0,0.0,4,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Lake Union, Seattle",2,1.0,0.0,0.0,0.0,1.0,0.0,3,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Rainier Valley, Seattle",2,0.0,0.0,0.0,0.0,0.0,0.0,2,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"West, Seattle",1,0.0,0.0,0.0,0.0,0.0,0.0,4,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
num_top_venues = 5

for hood in all_areas['Location']:
    print("----"+hood+"----")
    temp = all_areas[all_areas['Location'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----319, Avenue C, New York----
                 venue  freq
0            Wine Shop   4.0
1          Pizza Place   4.0
2                 Park   4.0
3            Juice Bar   4.0
4  Japanese Restaurant   3.0


----Beacon Hill, Seattle----
         venue  freq
0  Coffee Shop   8.0
1         Park   6.0
2      Brewery   4.0
3   Food Truck   4.0
4   Taco Place   3.0


----Capitol Hill, Seattle----
                venue  freq
0         Coffee Shop   9.0
1  Italian Restaurant   4.0
2              Bakery   4.0
3        Cocktail Bar   4.0
4      Sandwich Place   3.0


----Central, Seattle----
                   venue  freq
0            Coffee Shop   7.0
1  Vietnamese Restaurant   6.0
2                  Hotel   6.0
3         Sandwich Place   4.0
4                 Bakery   4.0


----Downtown, Seattle----
                venue  freq
0               Hotel   8.0
1         Coffee Shop   7.0
2      Sandwich Place   4.0
3              Bakery   4.0
4  Seafood Restaurant   3.0


----Lake Union, Seattle---

In [12]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [72]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Location']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

areas_attractions_sorted = pd.DataFrame(columns=columns)
areas_attractions_sorted['Location'] = all_areas['Location']

for ind in np.arange(all_areas.shape[0]):
    areas_attractions_sorted.iloc[ind, 1:] = return_most_common_venues(all_areas.iloc[ind, :], num_top_venues)

areas_attractions_sorted

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"319, Avenue C, New York",Juice Bar,Wine Shop,Park,Pizza Place,Coffee Shop,Japanese Restaurant,Ice Cream Shop,Gourmet Shop,Cocktail Bar,Bakery
1,"Beacon Hill, Seattle",Coffee Shop,Park,Food Truck,Brewery,Bakery,Taco Place,Pizza Place,Mexican Restaurant,Playground,Fish & Chips Shop
2,"Capitol Hill, Seattle",Coffee Shop,Cocktail Bar,Italian Restaurant,Bakery,Ice Cream Shop,Sandwich Place,Taco Place,Bar,Restaurant,Mexican Restaurant
3,"Central, Seattle",Coffee Shop,Vietnamese Restaurant,Hotel,Bakery,Cocktail Bar,Sandwich Place,French Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant
4,"Downtown, Seattle",Hotel,Coffee Shop,Sandwich Place,Bakery,Japanese Restaurant,Vietnamese Restaurant,Cocktail Bar,Deli / Bodega,Seafood Restaurant,Dumpling Restaurant
5,"Lake Union, Seattle",Coffee Shop,Park,Café,Italian Restaurant,Restaurant,Bakery,Sandwich Place,Cocktail Bar,Bar,Scenic Lookout
6,"Rainier Valley, Seattle",Vietnamese Restaurant,Pizza Place,Coffee Shop,Bar,Mexican Restaurant,Pub,Brewery,Gym,Bank,Gas Station
7,"West, Seattle",Coffee Shop,Vietnamese Restaurant,Hotel,Bakery,Cocktail Bar,Sandwich Place,French Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant


In [73]:
# set number of clusters
kclusters = 3

all_area_clustering = all_areas.drop('Location', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(all_area_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 1, 1, 0, 2, 1], dtype=int32)

In [74]:
areas_attractions_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [75]:
areas_attractions_sorted

Unnamed: 0,Cluster Labels,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,"319, Avenue C, New York",Juice Bar,Wine Shop,Park,Pizza Place,Coffee Shop,Japanese Restaurant,Ice Cream Shop,Gourmet Shop,Cocktail Bar,Bakery
1,0,"Beacon Hill, Seattle",Coffee Shop,Park,Food Truck,Brewery,Bakery,Taco Place,Pizza Place,Mexican Restaurant,Playground,Fish & Chips Shop
2,0,"Capitol Hill, Seattle",Coffee Shop,Cocktail Bar,Italian Restaurant,Bakery,Ice Cream Shop,Sandwich Place,Taco Place,Bar,Restaurant,Mexican Restaurant
3,1,"Central, Seattle",Coffee Shop,Vietnamese Restaurant,Hotel,Bakery,Cocktail Bar,Sandwich Place,French Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant
4,1,"Downtown, Seattle",Hotel,Coffee Shop,Sandwich Place,Bakery,Japanese Restaurant,Vietnamese Restaurant,Cocktail Bar,Deli / Bodega,Seafood Restaurant,Dumpling Restaurant
5,0,"Lake Union, Seattle",Coffee Shop,Park,Café,Italian Restaurant,Restaurant,Bakery,Sandwich Place,Cocktail Bar,Bar,Scenic Lookout
6,2,"Rainier Valley, Seattle",Vietnamese Restaurant,Pizza Place,Coffee Shop,Bar,Mexican Restaurant,Pub,Brewery,Gym,Bank,Gas Station
7,1,"West, Seattle",Coffee Shop,Vietnamese Restaurant,Hotel,Bakery,Cocktail Bar,Sandwich Place,French Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant


In [76]:
areas_attractions_sorted['Latitude'],  areas_attractions_sorted['Longitude'] = zip(*areas_attractions_sorted['Location'].apply(find_location))

The geograpical coordinate of 319, Avenue C, New York are 40.7317433, -73.9745526.
The geograpical coordinate of Beacon Hill, Seattle are 47.579257850000005, -122.31159768732729.
The geograpical coordinate of Capitol Hill, Seattle are 47.6238307, -122.3183689.
The geograpical coordinate of Central, Seattle are 47.6038321, -122.3300624.
The geograpical coordinate of Downtown, Seattle are 47.6048723, -122.3334582.
The geograpical coordinate of Lake Union, Seattle are 47.63991865, -122.33555809202913.
The geograpical coordinate of Rainier Valley, Seattle are 47.552544, -122.2908894.
The geograpical coordinate of West, Seattle are 47.6038321, -122.3300624.


In [77]:
final_df = areas_attractions_sorted

In [78]:
latitude, longitude = find_location('Seattle')
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final_df['Latitude'], final_df['Longitude'], final_df['Location'], final_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster) - 1],
        fill=True,
        fill_color=rainbow[int(cluster) - 1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The geograpical coordinate of Seattle are 47.6038321, -122.3300624.


In [79]:
final_df.loc[final_df['Cluster Labels'] == 0]

Unnamed: 0,Cluster Labels,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Latitude,Longitude
0,0,"319, Avenue C, New York",Juice Bar,Wine Shop,Park,Pizza Place,Coffee Shop,Japanese Restaurant,Ice Cream Shop,Gourmet Shop,Cocktail Bar,Bakery,40.731743,-73.974553
1,0,"Beacon Hill, Seattle",Coffee Shop,Park,Food Truck,Brewery,Bakery,Taco Place,Pizza Place,Mexican Restaurant,Playground,Fish & Chips Shop,47.579258,-122.311598
2,0,"Capitol Hill, Seattle",Coffee Shop,Cocktail Bar,Italian Restaurant,Bakery,Ice Cream Shop,Sandwich Place,Taco Place,Bar,Restaurant,Mexican Restaurant,47.623831,-122.318369
5,0,"Lake Union, Seattle",Coffee Shop,Park,Café,Italian Restaurant,Restaurant,Bakery,Sandwich Place,Cocktail Bar,Bar,Scenic Lookout,47.639919,-122.335558


In [80]:
final_df.loc[final_df['Cluster Labels'] == 1]

Unnamed: 0,Cluster Labels,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Latitude,Longitude
3,1,"Central, Seattle",Coffee Shop,Vietnamese Restaurant,Hotel,Bakery,Cocktail Bar,Sandwich Place,French Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant,47.603832,-122.330062
4,1,"Downtown, Seattle",Hotel,Coffee Shop,Sandwich Place,Bakery,Japanese Restaurant,Vietnamese Restaurant,Cocktail Bar,Deli / Bodega,Seafood Restaurant,Dumpling Restaurant,47.604872,-122.333458
7,1,"West, Seattle",Coffee Shop,Vietnamese Restaurant,Hotel,Bakery,Cocktail Bar,Sandwich Place,French Restaurant,Deli / Bodega,Seafood Restaurant,Japanese Restaurant,47.603832,-122.330062


In [81]:
final_df.loc[final_df['Cluster Labels'] == 2]

Unnamed: 0,Cluster Labels,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Latitude,Longitude
6,2,"Rainier Valley, Seattle",Vietnamese Restaurant,Pizza Place,Coffee Shop,Bar,Mexican Restaurant,Pub,Brewery,Gym,Bank,Gas Station,47.552544,-122.290889
