# Capstone Project - The Battle of Neighborhoods 

# Description of Problem

In this project, I will try to find out the best place to start the restaurant business in New York, New York. New York is the most populous and densely populated city with estimated population of 8,500,000 people. Also, millions of tourists visit the city each year, making it a good place to start a restaurant business. However, the problem is that starting a restaurant business in the city becomes more competitive every year due to rising costs and visitors' expectations. In this notebook, we will be going over some of the features of New York City and find out which type of cusines would best fit each district. Current/future restaurant owners would benefit from the analysis.

In order to define which features should be used to determine whether a place is good, we will examine several criteria. For instance, if someone is looking to open a French restaurant in a certain location, then we will be looking at the features such as:
- Popular cuisine type of neighborhoods
- Number of specific restaurants in the neighborhood
- Percentage of restaurant in the neighborhood

Furthermore, we will be finding the most similar neighborhood in another city (Toronto) using:
- Cosine Similarity
- Euclidean Distance

and explain why such thing might have happened

We will be segmenting the city based on ZIP code and clustering neighborhoods. We will be looking at the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. 


# Description of Data

We will be using following data sources and libraries to solve the problem:
- New York City neighborhood dataset
 - https://geo.nyu.edu/catalog/nyu_2451_34572
 - https://cocl.us/new_york_dataset
- Toronto neighborhood dataset
 - https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- Wikipedia Zip Code data
- Geocode 
 - Convert Zip Code data into longitude, latitude pair
- Foursquare API 
 - Get relevant information based on the location
- List of Cuisine Types
 - https://en.wikipedia.org/wiki/List_of_cuisines
- folium
 - Visualize data

We will be using various data sources, and we would spend most of the time cleaning and preparing the data. We will get geographic information of New York City using Wikipedia, Geocode, and other available dataset and use Foursquare API to obtain the restaurant information of the locations. Also, if the time allows, we will be visualizing data with folium.

In [1]:
# Import Libraries
import pandas as pd
import requests as rq
import numpy as np
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
!pip install folium
import folium
from matplotlib import pyplot as plt
from matplotlib import cm, colors
import json
from collections import defaultdict

from sklearn.metrics.pairwise import paired_euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
# Download Data
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/72/ff/004bfe344150a064e558cb2aedeaa02ecbf75e60e148a55a9198f0c41765/folium-0.10.0-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 1.7MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.0


In [2]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [3]:
neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)

In [4]:
df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [5]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
    
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    df = df.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [6]:
df.head(10)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


In [7]:
print(df.shape)

(306, 4)


In [8]:
CLIENT_ID =
CLIENT_SECRET = 
VERSION = '20180605' # Foursquare API version
LIMIT = '100'

We are only interested in restaurants/foods, so we will be just looking at the venues that have parent category "Food" 

Overall hiearchy can be found in :

https://developer.foursquare.com/docs/resources/categories

In [9]:
url = "https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}".format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION)
response = rq.get(url)
lst = response.json()['response']


In [10]:
categories = defaultdict(list)
def findParents(catlst=lst, parents=[]):
    for c in catlst['categories']:
        categories[c['name']] = parents + [c['name']]
        findParents(c, parents + [c['name']])
findParents()


In [11]:
print(categories)

defaultdict(<class 'list'>, {'Arts & Entertainment': ['Arts & Entertainment'], 'Amphitheater': ['Arts & Entertainment', 'Amphitheater'], 'Aquarium': ['Arts & Entertainment', 'Aquarium'], 'Arcade': ['Arts & Entertainment', 'Arcade'], 'Art Gallery': ['Arts & Entertainment', 'Art Gallery'], 'Bowling Alley': ['Arts & Entertainment', 'Bowling Alley'], 'Casino': ['Arts & Entertainment', 'Casino'], 'Circus': ['Arts & Entertainment', 'Circus'], 'Comedy Club': ['Arts & Entertainment', 'Comedy Club'], 'Concert Hall': ['Arts & Entertainment', 'Concert Hall'], 'Country Dance Club': ['Arts & Entertainment', 'Country Dance Club'], 'Disc Golf': ['Arts & Entertainment', 'Disc Golf'], 'Exhibit': ['Arts & Entertainment', 'Exhibit'], 'General Entertainment': ['Arts & Entertainment', 'General Entertainment'], 'Go Kart Track': ['Arts & Entertainment', 'Go Kart Track'], 'Historic Site': ['Arts & Entertainment', 'Historic Site'], 'Karaoke Box': ['Arts & Entertainment', 'Karaoke Box'], 'Laser Tag': ['Arts & E

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        # make the GET request
        try:
            results = rq.get(url).json()["response"]['groups'][0]['items']
        except:
            continue
        
        for v in results:
            if categories[v['venue']['categories'][0]['name']] != None and 'Food' in categories[v['venue']['categories'][0]['name']]:
        # return only relevant information for each nearby venue
                venues_list.append([(
                    name, 
                    lat, 
                    lng, 
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['name'])])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
nyc_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'])

In [14]:
print(nyc_venues.shape)
nyc_venues.head(20)

(5418, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
2,Wakefield,40.894705,-73.847201,Cooler Runnings Jamaican Restaurant Inc,40.898276,-73.850381,Caribbean Restaurant
3,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop
4,Wakefield,40.894705,-73.847201,SUBWAY,40.890656,-73.849192,Sandwich Place
5,Wakefield,40.894705,-73.847201,Baychester Avenue Food Truck,40.892293,-73.84323,Food Truck
6,Wakefield,40.894705,-73.847201,Louis Pizza,40.898457,-73.84877,Pizza Place
7,Co-op City,40.874294,-73.829939,Capri II Pizza,40.876374,-73.82994,Pizza Place
8,Co-op City,40.874294,-73.829939,Baskin Robbins,40.870045,-73.829578,Ice Cream Shop
9,Co-op City,40.874294,-73.829939,Arby's,40.870518,-73.828657,Fast Food Restaurant


In [15]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [16]:
# one hot encoding
nyc_onehot = pd.get_dummies(nyc_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
nyc_onehot['Neighborhood'] = nyc_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [nyc_onehot.columns[-1]] + list(nyc_onehot.columns[:-1])
nyc_onehot = nyc_onehot[fixed_columns]

nyc_onehot.head()


Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,Austrian Restaurant,BBQ Joint,...,Tex-Mex Restaurant,Thai Restaurant,Tibetan Restaurant,Turkish Restaurant,Udon Restaurant,Varenyky restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Wings Joint
0,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
nyc_grouped = nyc_onehot.groupby('Neighborhood').mean().reset_index()
nyc_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,Austrian Restaurant,BBQ Joint,...,Tex-Mex Restaurant,Thai Restaurant,Tibetan Restaurant,Turkish Restaurant,Udon Restaurant,Varenyky restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Wings Joint
0,Allerton,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Annadale,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arlington,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Restaurants'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Restaurants'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = nyc_grouped['Neighborhood']

for ind in np.arange(nyc_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nyc_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(10)

Unnamed: 0,Neighborhood,1st Most Common Restaurants,2nd Most Common Restaurants,3rd Most Common Restaurants,4th Most Common Restaurants,5th Most Common Restaurants,6th Most Common Restaurants,7th Most Common Restaurants,8th Most Common Restaurants,9th Most Common Restaurants,10th Most Common Restaurants
0,Allerton,Deli / Bodega,Pizza Place,Dessert Shop,Food,Fast Food Restaurant,Fried Chicken Joint,Breakfast Spot,Chinese Restaurant,Spanish Restaurant,Donut Shop
1,Annadale,American Restaurant,Pizza Place,Restaurant,Diner,Bakery,English Restaurant,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant
2,Arden Heights,Deli / Bodega,Pizza Place,Coffee Shop,Donut Shop,Diner,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Food Court,Empanada Restaurant
3,Arlington,American Restaurant,Wings Joint,English Restaurant,Food,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Empanada Restaurant
4,Arrochar,Deli / Bodega,Italian Restaurant,Sandwich Place,Mediterranean Restaurant,Food Truck,Bagel Shop,Middle Eastern Restaurant,Pizza Place,Eastern European Restaurant,Egyptian Restaurant
5,Arverne,Sandwich Place,Coffee Shop,Pizza Place,Donut Shop,Thai Restaurant,Empanada Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant
6,Astoria,Middle Eastern Restaurant,Greek Restaurant,Seafood Restaurant,Bakery,Bubble Tea Shop,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Latin American Restaurant,Mediterranean Restaurant
7,Astoria Heights,Italian Restaurant,Pizza Place,Burger Joint,Bakery,Wings Joint,Ethiopian Restaurant,Food,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant
8,Auburndale,Korean Restaurant,American Restaurant,Ice Cream Shop,Italian Restaurant,Noodle House,Fast Food Restaurant,Wings Joint,Fish & Chips Shop,Filipino Restaurant,Falafel Restaurant
9,Bath Beach,Chinese Restaurant,Donut Shop,Bubble Tea Shop,Pizza Place,Deli / Bodega,Italian Restaurant,Fast Food Restaurant,Sushi Restaurant,Peruvian Restaurant,Diner


In [19]:
# set number of clusters
kclusters = 5

nyc_grouped_clustering = nyc_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nyc_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 1, 3, 1, 0, 3, 3, 3, 3], dtype=int32)

In [20]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

nyc_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
nyc_merged = nyc_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# Drop NaN
nyc_merged.dropna(axis=0, how='any',inplace=True)
nyc_merged.reset_index(inplace=True, drop=True)

### Let's map the neighborhoods

In [21]:
# Used Google for finding NYC latitude and longitude
# 40.7128° N, 74.0060° W

nyc_lat = 40.7128
nyc_lon = -74.0060

# create map
map_clusters = folium.Map(location=[nyc_lat, nyc_lon], zoom_start=11)



# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nyc_merged['Latitude'], nyc_merged['Longitude'], nyc_merged['Neighborhood'], nyc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        # has to cast int to 'cluster' since it has a type float
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Let's find out which neighborhoods belong to each cluster

In [42]:
for i in range(kclusters):
    print('Cluster ' + str(i) + ': \n {}'.format(', '.join(nyc_merged[nyc_merged["Cluster Labels"] == i]["Neighborhood"])))

Cluster 0: 
 Wakefield, Co-op City, Kingsbridge, Woodlawn, Norwood, Baychester, Bedford Park, University Heights, Morris Heights, Fordham, East Tremont, High  Bridge, Melrose, Mott Haven, Longwood, Morrisania, Parkchester, Westchester Square, Morris Park, North Riverdale, Schuylerville, Castle Hill, Pelham Gardens, Unionport, Manhattan Terrace, Crown Heights, Cypress Hills, Starrett City, Manhattan Beach, Borough Park, City Line, Bergen Beach, Midwood, Prospect Park South, Richmond Hill, East Elmhurst, Maspeth, Glendale, Ozone Park, Glen Oaks, Bellerose, Kew Gardens Hills, Fresh Meadows, Rochdale, Springfield Gardens, Far Rockaway, Beechhurst, Edgemere, Arverne, Floral Park, Holliswood, Lindenwood, Rockaway Park, St. George, Castleton Corners, New Springville, Great Kills, Eltingville, Annadale, Dongan Hills, Grant City, Pleasant Plains, Rossville, Greenridge, Heartland Village, Bulls Head, New Lots, Utopia, Pomonok, Claremont Village, Mount Eden, Mount Hope, Manor Heights, Sandy Groun

# Finding the most similar neighborhood in Toronto

In [23]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikiResponse = rq.get(url)
soup = BeautifulSoup(wikiResponse.text,'lxml') 

items = []
rows = soup.table.find_all('tr')
for row in rows[1:]:
    cols = row.find_all('td')
    postal_code = cols[0].text
    borough = cols[1].text
    neighborhood = cols[2].text.strip()
    items.append([postal_code, borough, neighborhood])

In [24]:
df = pd.DataFrame(items, columns=["PostalCode", "Borough", "Neighborhood"])
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df.drop(df[df['Borough']=='Not assigned'].index, inplace=True) # Drop inplace
df.reset_index(inplace=True, drop=True)
df['Neighborhood'].where(df['Neighborhood']!='Not assigned', df['Borough'], inplace=True)
df = df.groupby(by=["PostalCode", "Borough"], as_index=False).aggregate(lambda neighborhoods: ", ".join(set(neighborhoods)))
df.reset_index(inplace=True, drop=True)
df_coord = pd.read_csv('https://cocl.us/Geospatial_data')
# Match the column names (Postal Code -> Post)
df_coord.rename(index=str, columns={"Postal Code":"PostalCode"},inplace=True)

df_coord.head(10)
df_geo = df.merge(df_coord, how='inner', on='PostalCode')


In [25]:
toronto_venues = getNearbyVenues(names=df_geo['Neighborhood'],
                                   latitudes=df_geo['Latitude'],
                                   longitudes=df_geo['Longitude'])

In [26]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,Bistro,Brazilian Restaurant,...,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint
0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Morningside, Guildwood, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Morningside, Guildwood, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Morningside, Guildwood, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Woburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,Bistro,Brazilian Restaurant,...,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint
0,"Adelaide, Richmond, King",0.0,0.048387,0.048387,0.0,0.0,0.032258,0.0,0.0,0.016129,...,0.032258,0.0,0.0,0.0,0.0,0.064516,0.0,0.016129,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Berczy Park,0.0,0.0,0.0,0.0,0.034483,0.068966,0.034483,0.034483,0.0,...,0.0,0.0,0.0,0.0,0.034483,0.034483,0.0,0.034483,0.0,0.0


### Finding 10 most common restaurants of each neighborhoods

In [41]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Restaurants'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Restaurants'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Restaurants,2nd Most Common Restaurants,3rd Most Common Restaurants,4th Most Common Restaurants,5th Most Common Restaurants,6th Most Common Restaurants,7th Most Common Restaurants,8th Most Common Restaurants,9th Most Common Restaurants,10th Most Common Restaurants
0,"Adelaide, Richmond, King",Coffee Shop,Café,Steakhouse,Thai Restaurant,American Restaurant,Asian Restaurant,Restaurant,Burger Joint,Salad Place,Pizza Place
1,Agincourt,Breakfast Spot,Wings Joint,Fast Food Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant
2,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Sandwich Place,Wings Joint,Cupcake Shop,Dessert Shop,Dim Sum Restaurant,Diner,Doner Restaurant,Donut Shop
3,Bayview Village,Chinese Restaurant,Japanese Restaurant,Café,Wings Joint,Falafel Restaurant,Dim Sum Restaurant,Diner,Doner Restaurant,Donut Shop,Dumpling Restaurant
4,Berczy Park,Coffee Shop,Bakery,Café,Steakhouse,Seafood Restaurant,Breakfast Spot,Greek Restaurant,Eastern European Restaurant,Italian Restaurant,Diner


## Let's say we want to find out the neighborhood in Toronto that is most similar to Battery Park City, NY

### Feature vector of Battery Park City

In [34]:
nyc_grouped[nyc_grouped['Neighborhood']=='Battery Park City']

nyc_features = nyc_grouped.loc[10,cols].drop('Neighborhood')
print(nyc_features)

Afghan Restaurant                          0
American Restaurant                0.0263158
Asian Restaurant                           0
BBQ Joint                          0.0526316
Bagel Shop                                 0
Bakery                             0.0263158
Bistro                             0.0263158
Brazilian Restaurant                       0
Breakfast Spot                             0
Bubble Tea Shop                            0
Burger Joint                       0.0526316
Burrito Place                      0.0263158
Cafeteria                                  0
Café                                       0
Cajun / Creole Restaurant                  0
Caribbean Restaurant                       0
Chinese Restaurant                 0.0263158
Coffee Shop                         0.184211
Colombian Restaurant                       0
Comfort Food Restaurant                    0
Creperie                                   0
Cuban Restaurant                           0
Cupcake Sh

## Cosine Similarity

### Calculating Cosine Similarity

In [37]:
cols = nyc_grouped.columns.intersection(toronto_grouped.columns)

from scipy.spatial.distance import cosine


cossim= []
for i in range(len(toronto_grouped)):
    cossim.append(1-cosine(nyc_features.values, toronto_grouped.loc[i,cols].drop('Neighborhood').values))


df3 = pd.merge(df_geo, toronto_grouped, on='Neighborhood', how='inner')

In [38]:
df3.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Afghan Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,...,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1E,Scarborough,"Morningside, Guildwood, West Hill",43.763573,-79.188711,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0
4,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Top 5 Closest Neighborhood using Cosine Similarity

In [39]:
df3['cossim'] = cossim

top5 = df3[['Neighborhood','cossim']].sort_values(by='cossim', ascending=False).head(6)
top5




Unnamed: 0,Neighborhood,cossim
70,"Swansea, Runnymede",0.792118
10,"Maryvale, Wexford",0.784633
49,"Victoria Hotel, Commerce Court",0.772061
69,"Roncesvalles, Parkdale",0.72337
52,"Yorkville, North Midtown, The Annex",0.702311
39,"St. James Town, Cabbagetown",0.693559
