# Comparison of Businesses in Toronto and New York City

## Introduction

The cities of Toronto, Canada and New York City, United States are both the largest financial regions of their respective countries. 6.9% of Canada's population lives inside Toronto, and 2.5% of the United States population live in New York City. Also taking into account the amount of citizens commuting to Toronto or New York City, and it is easy to see why many businesses see the two areas as goldmines. However, if one were to open up a new business, which city would be best to open that business up in based on other openings of similar businesses in the area, and which neighborhood/borough of their respective cities would be best to open the business in. The outcome of this data is important as it drives prospective small business owners/corporations to have a nice idea of where their business can thrive.

## Data

The data we will be using will consist of a json file courtesy of cocl.us containing boroughs/neighborhoods of New York City with longitude and latitude coordinates. We will also have borough/neighborhood data courtesy for Toronto courtesy of Wikipedia. We will also receive a csv containing postal code latitude and longitude points for Toronto courtesy of cocl.us. Finally, for the bulk of the information, we will be using data from Foursquare in order to see businesses within the two cities. In summary, the data we will be using will contain boroughs/neighborhoods of each city, latitude/longitude coordinates for each borough, and samples of businesses within 500 meters from each borough.

## Methodology

First, we load the data and prepare data frames. After loading the data, the first few rows of each are as follows:

In [2]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library



Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [7]:
Canada = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M','r')

TorontoData = pd.DataFrame(columns = ['Postcode','Borough','Neighborhood'])
TorontoData = TorontoData.append(Canada)

!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [6]:
TorontoData.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Notice that while the New York dataset already comes relatively clean, we need to improve the Toronto dataset. We clean this dataset by removing any rows with missing boroughs, replacing any missing neighborhoods with their borough name, and combining matching postcodes into one row of data. Finally, we load the latitude/longitude coordinates for the Toronto data and append them to the data. The first few rows of the Toronto data are now:

In [9]:
TorontoData['Borough'].replace('Not assigned', np.NaN, inplace = True)
TorontoData.dropna(axis = 0, inplace = True)
TorontoData.reset_index(inplace = True)
TorontoData.drop(['index'], axis = 1)

TorontoData.replace('Not assigned','Queen\'s Park', inplace = True)
TorontoData.drop(['index'], axis = 1, inplace = True)

ndf = pd.DataFrame(columns = ['Postcode','Borough','Neighborhoods'])
postcode = 'M3A'
for i in list(range(len(TorontoData[['Postcode']])-7)):
    if postcode not in ndf['Postcode']:
        postcode = TorontoData['Postcode'].iloc[i]
        nlist = [postcode,TorontoData.iloc[i,2]]
        town = []
        for j in range(TorontoData['Postcode'].value_counts().loc[postcode]):
            if TorontoData['Postcode'].iloc[i+j] == postcode:
                town.append(TorontoData['Neighborhood'].iloc[i+j])
        nlist.append(', '.join(town))
        ndata = pd.DataFrame([[nlist[0],nlist[1],nlist[2]]],columns = ['Postcode','Borough','Neighborhoods'])
        ndf = ndf.append(ndata)
        i = i + j

ndf.reset_index(inplace = True)
droplist = []
for i in range((len(ndf[['Postcode']])-7)):
    if i > 0 and ndf['Postcode'].iloc[i] == ndf['Postcode'].iloc[i-1]:
        droplist.append(i)
ndf.drop(droplist, axis = 0, inplace = True)
ndf.set_index('Postcode', inplace = True)

ndf.drop(['index'], axis = 1, inplace = True)

ll = pd.read_csv('https://cocl.us/Geospatial_data','r')

coor = []
for i in range(ll.size):
    coor.append(ll.iloc[i,0].split(','))
coordinates = pd.DataFrame(coor,columns = ['Postcode','Latitude','Longitude'])
coordinates.set_index('Postcode', inplace = True)
coordinates.head()

Toronto = ndf.join(coordinates)
Toronto.head()

Unnamed: 0_level_0,Borough,Neighborhoods,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Rouge,"Rouge, Malvern",43.8066863,-79.1943534
M1C,Highland Creek,"Highland Creek, Rouge Hill, Port Union",43.7845351,-79.1604971
M1E,Guildwood,"Guildwood, Morningside, West Hill",43.7635726,-79.1887115
M1G,Woburn,Woburn,43.7709921,-79.2169174
M1H,Cedarbrae,Cedarbrae,43.773136,-79.2394761


With the Toronto and New York datasets cleaned up, we now use the package folium to get a view of each borough for both Toronto and New York city.

In [10]:
Toronto[['Latitude','Longitude']] = Toronto[['Latitude','Longitude']].apply(pd.to_numeric, errors = 'coerce')

map_toronto = folium.Map(location=[43.70011, -79.4163], zoom_start=10)

for lat, lng, borough, neighborhood in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Borough'], Toronto['Neighborhoods']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [11]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Now that the boroughs are established, we are ready to cluster the data.

In [22]:
Toronto.rename(columns = {'Neighborhoods':'Neighborhood'},inplace = True)

In [27]:
AlltheData = Toronto.append(neighborhoods,sort = True)
AlltheData.reset_index(inplace = True)

In [33]:
LIMIT = 100
radius = 500

In [31]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
    
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [40]:
venues = getNearbyVenues(names=AlltheData['Neighborhood'],
                                   latitudes=AlltheData['Latitude'],
                                   longitudes=AlltheData['Longitude']
                                  )


Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [45]:
print(venues.shape)
venues.head()

(12449, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


In [47]:
venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,5,5,5,5,5,5
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",2,2,2,2,2,2
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",9,9,9,9,9,9
"Alderwood, Long Branch",9,9,9,9,9,9
Allerton,34,34,34,34,34,34
Annadale,8,8,8,8,8,8
Arden Heights,4,4,4,4,4,4
Arlington,4,4,4,4,4,4
Arrochar,19,19,19,19,19,19


In [48]:
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
onehot['Neighborhood'] = venues['Neighborhood']
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

In [50]:
onehot.shape

(12449, 459)

In [51]:
venues_grouped = onehot.groupby('Neighborhood').mean().reset_index()

In [52]:
venues_grouped.shape

(403, 459)

In [53]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [55]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = venues_grouped['Neighborhood']

for ind in np.arange(venues_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venues_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Bar,Burger Joint,Salad Place,Asian Restaurant,Sushi Restaurant,Bakery,Restaurant
1,Agincourt,Breakfast Spot,Lounge,Skating Rink,Clothing Store,Latin American Restaurant,Eastern European Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Ethiopian Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Playground,Park,Falafel Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Ethiopian Restaurant,Event Service
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Beer Store,Liquor Store,Pizza Place,Sandwich Place,Fried Chicken Joint,Fast Food Restaurant,Pharmacy,Event Service,Event Space
4,"Alderwood, Long Branch",Pizza Place,Pub,Athletics & Sports,Coffee Shop,Sandwich Place,Pharmacy,Skating Rink,Gym,Egyptian Restaurant,Electronics Store


In [56]:
kclusters = 5

venues_grouped_clustering = venues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venues_grouped_clustering)

In [71]:
venues_merged = AlltheData

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
venues_merged = venues_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [72]:
venues_merged.loc[venues_merged['Cluste Labels'] == 0, venues_merged.columns[[1] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Latitude,ClusterLabels,Cluste Labels,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,43.784535,0.0,0.0,0.0,Bar,Women's Store,Farm,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Ethiopian Restaurant,Event Service,Event Space
3,43.770992,0.0,0.0,0.0,Coffee Shop,Indian Restaurant,Korean Restaurant,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Ethiopian Restaurant
4,43.773136,0.0,0.0,0.0,Bakery,Hakka Restaurant,Bank,Thai Restaurant,Fried Chicken Joint,Gas Station,Caribbean Restaurant,Athletics & Sports,Event Service,Exhibit
5,43.744734,0.0,0.0,0.0,Spa,Playground,Women's Store,Falafel Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Ethiopian Restaurant
6,43.727929,0.0,0.0,0.0,Discount Store,Convenience Store,Department Store,Coffee Shop,Hobby Shop,Chinese Restaurant,Curling Ice,Cycle Studio,Electronics Store,Empanada Restaurant
7,43.711112,0.0,0.0,0.0,Bus Line,Bakery,Park,Intersection,Fast Food Restaurant,Soccer Field,Bus Station,Eye Doctor,Exhibit,Event Space
8,43.716316,0.0,0.0,0.0,Motel,American Restaurant,Women's Store,Farm,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Ethiopian Restaurant,Event Service
9,43.692657,0.0,0.0,0.0,Skating Rink,College Stadium,Café,General Entertainment,Women's Store,Factory,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant
10,43.75741,0.0,0.0,0.0,Indian Restaurant,Vietnamese Restaurant,Pet Store,Chinese Restaurant,Cuban Restaurant,Falafel Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant
11,43.750072,0.0,0.0,0.0,Middle Eastern Restaurant,Breakfast Spot,Bakery,Shopping Mall,Auto Garage,Sandwich Place,Falafel Restaurant,Electronics Store,Empanada Restaurant,English Restaurant


## Results