# Use machine learning to predict business start locations intelligently

## 1. Introduction 

Small businesses are grouping a critical section in a developed or developing economy. Traditionally, owners or entrepreneurs of small business start their business according to experience or very limited market investigation due to investment limitation or personal knowledge shortage. Big companies normally conduct this by its special department where there are professional analyst or data scientist, or out sourcing this to professional vendors. To facilitate small business owners make a informal decision, especially for the business location, a machine learning powered way introduced in this report. 

In [48]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans, DBSCAN
import requests
from bs4 import BeautifulSoup

## 2. Method

## 2.1 Data Collection

There are many famous data providers who can provide business data based on location like Google, Facebook and Foursquare. Foursquare location data is deployed through its comprehensive and convenient web-based restful API because it is free and growing every day. 

All the analysis is based on the Toronto, CA, so public borough and neighbourhood data is also extracted from public wikipedia page. 

So the data in the report combines the Foursquare location-based business data and Toronto city borough and neighbourhood information.  

####   Get and unserstaing data : the public borough and neighbourhood

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
soup_html = BeautifulSoup(source, 'lxml')

In [3]:
table_string = soup_html.find_all('table', 'wikitable sortable')[0]

In [4]:
pd_table = pd.read_html(str(table_string))[0]
pd_table.columns = pd_table.iloc[0]
pd_table = pd_table[1:]

#### Cleansing: 

1. Delete Borough with "Not assigned." 
2. Assign neighourhood with "Not assigned." 
3. Combine more Neighbourhoods for same postal code area because the latitude and longitude will base on this.

In [9]:
borough_assigned = pd_table[pd_table['Borough']!='Not assigned'].reset_index(drop=True)
borough_assigned[borough_assigned['Neighbourhood']=='Not assigned'] = borough_assigned['Borough']

In [10]:
def join_array(arr):
    return ",".join(arr)

groupby_Postcode = borough_assigned.groupby('Postcode').agg({'Neighbourhood': join_array, 
                                                             'Borough': lambda x: list(set(x))[0]})

neighbourhoods = groupby_Postcode.reset_index()
print(neighbourhoods.shape)
neighbourhoods.head()

(103, 3)


Unnamed: 0,Postcode,Neighbourhood,Borough
0,M1B,"Rouge,Malvern",Scarborough
1,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough
2,M1E,"Guildwood,Morningside,West Hill",Scarborough
3,M1G,Woburn,Scarborough
4,M1H,Cedarbrae,Scarborough


#### More Data - longtitude and latitude

In [86]:
geo_coordinates_url = 'http://cocl.us/Geospatial_data'
geo_coordinates = pd.read_csv(geo_coordinates_url)
print(geo_coordinates.shape)
geo_coordinates.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### combine and get the borough and neighbourhoods data with latitude and longitude 

In [89]:
neighbourhoods_with_lng_lat = pd.merge(neighbourhoods, geo_coordinates, left_on="Postcode", right_on='Postal Code')
neighbourhoods_with_lng_lat.drop(['Postal Code'], axis=1, inplace=True)
print(neighbourhoods_with_lng_lat.shape)
neighbourhoods_with_lng_lat.head()

(102, 5)


Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude
0,M1B,"Rouge,Malvern",Scarborough,43.806686,-79.194353
1,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough,43.784535,-79.160497
2,M1E,"Guildwood,Morningside,West Hill",Scarborough,43.763573,-79.188711
3,M1G,Woburn,Scarborough,43.770992,-79.216917
4,M1H,Cedarbrae,Scarborough,43.773136,-79.239476


### specify and constraint in area 'Toronto"

In [14]:
toronto_data = neighbourhoods_with_lng_lat[neighbourhoods_with_lng_lat['Borough'].str
                                                                                 .contains('Toronto')] \
                                                                                 .reset_index(drop=True)

In [97]:
toronto_data = neighbourhoods_with_lng_lat

In [98]:
print(toronto_data.shape)
toronto_data.head()

(102, 5)


Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude
0,M1B,"Rouge,Malvern",Scarborough,43.806686,-79.194353
1,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough,43.784535,-79.160497
2,M1E,"Guildwood,Morningside,West Hill",Scarborough,43.763573,-79.188711
3,M1G,Woburn,Scarborough,43.770992,-79.216917
4,M1H,Cedarbrae,Scarborough,43.773136,-79.239476


Let's see the this on the map

In [99]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto'

geolocator = Nominatim(user_agent='test')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Let's view the neighborhoods in map

In [100]:
import folium # map rendering library

# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Get and understanding data from Foursquare API

#### Define Foursquare Credentials and Version

In [94]:
CLIENT_ID = 'ME0B4UESS5YNR2IZBNIQMQCRXIQUVOCCNP3R3A0E0QEC2WZH' # your Foursquare ID
CLIENT_SECRET = 'R55W1ZBE1QB5JG0AM4SZXOFR0PDGJJ24YHTVF14R2RKXFWNR' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ME0B4UESS5YNR2IZBNIQMQCRXIQUVOCCNP3R3A0E0QEC2WZH
CLIENT_SECRET:R55W1ZBE1QB5JG0AM4SZXOFR0PDGJJ24YHTVF14R2RKXFWNR


In [95]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Explore Neighborhoods in Toronto

####  function to repeat the same process to all the neighborhoods in Toronto

In [96]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighbourhood and create a new dataframe called *toronto_venues*.

In [102]:
LIMIT = 100

neighborhood_latitudes = toronto_data.loc[:, 'Latitude'] # neighborhood latitude value
neighborhood_longitudes = toronto_data.loc[:, 'Longitude'] # neighborhood longitude value
neighborhood_names = toronto_data.loc[:, 'Neighbourhood'] # neighborhood name

print(len(pd.unique(neighborhood_names)))

toronto_venues = getNearbyVenues(neighborhood_names, neighborhood_latitudes, neighborhood_longitudes)


102
Rouge,Malvern
Highland Creek,Rouge Hill,Port Union
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Clairlea,Golden Mile,Oakridge
Cliffcrest,Cliffside,Scarborough Village West
Birch Cliff,Cliffside West
Dorset Park,Scarborough Town Centre,Wexford Heights
Maryvale,Wexford
Agincourt
Clarks Corners,Sullivan,Tam O'Shanter
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West,Steeles West
Upper Rouge
Hillcrest Village
Fairview,Henry Farm,Oriole
Bayview Village
Silver Hills,York Mills
Newtonbrook,Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park,Don Mills South
Bathurst Manor,Downsview North,Wilson Heights
Northwood Park,York University
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens,Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West,Riverdale
The 

#### Explore the new venues data

In [103]:
print(toronto_venues.shape)
toronto_venues.head()

(2204, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge,Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course
2,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood,Morningside,West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


#### Venue Category will be the main features 

In [104]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",2,2,2,2,2,2
"Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown",10,10,10,10,10,10
"Alderwood,Long Branch",10,10,10,10,10,10
"Bathurst Manor,Downsview North,Wilson Heights",17,17,17,17,17,17
Bayview Village,4,4,4,4,4,4
"Bedford Park,Lawrence Manor East",27,27,27,27,27,27
Berczy Park,54,54,54,54,54,54
"Birch Cliff,Cliffside West",4,4,4,4,4,4


In [219]:
categories_num = len(toronto_venues['Venue Category'].unique())
print('There are {} uniques categories.'.format(categories_num))

There are 273 uniques categories.


## 3. Analyze Each Neighborhood

### 3.1 Feature Engineering, Get Dummy features based on the Venue Category and venues quantities

In [70]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot.rename(columns={'Neighborhood': 'Venue_Neighborhood'}, inplace=True)

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2204, 274)


Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Highland Creek,Rouge Hill,Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Highland Creek,Rouge Hill,Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood,Morningside,West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood,Morningside,West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Get the last features for each neighborhood

for each neighbourhood, sum(means of category feature) = 1

In [262]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

(98, 274)


Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood,Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
sum(toronto_grouped.iloc[0,1:].values)

1.0000000000000004

In [73]:
categories = toronto_grouped.columns[1:]
categories = categories.str.lower()
list(categories)

['accessories store',
 'adult boutique',
 'afghan restaurant',
 'airport',
 'airport food court',
 'airport gate',
 'airport lounge',
 'airport service',
 'airport terminal',
 'american restaurant',
 'antique shop',
 'aquarium',
 'arepa restaurant',
 'art gallery',
 'art museum',
 'arts & crafts store',
 'asian restaurant',
 'astrologer',
 'athletics & sports',
 'auto garage',
 'auto workshop',
 'bbq joint',
 'baby store',
 'bagel shop',
 'bakery',
 'bank',
 'bar',
 'baseball field',
 'baseball stadium',
 'basketball court',
 'basketball stadium',
 'beach',
 'beer bar',
 'beer store',
 'belgian restaurant',
 'bike shop',
 'bistro',
 'board shop',
 'boat or ferry',
 'bookstore',
 'boutique',
 'brazilian restaurant',
 'breakfast spot',
 'brewery',
 'bridal shop',
 'bubble tea shop',
 'burger joint',
 'burrito place',
 'bus line',
 'bus station',
 'bus stop',
 'butcher',
 'cafeteria',
 'café',
 'cajun / creole restaurant',
 'candy store',
 'caribbean restaurant',
 'cheese shop',
 'chinese

#### Explore the most common venues

In [74]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [75]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,American Restaurant,Hotel,Clothing Store,Gym,Bakery,Restaurant
1,Agincourt,Skating Rink,Clothing Store,Lounge,Breakfast Spot,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Playground,Park,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Pizza Place,Grocery Store,Beer Store,Fried Chicken Joint,Sandwich Place,Fast Food Restaurant,Coffee Shop,Pharmacy,Comic Shop,Event Space
4,"Alderwood,Long Branch",Pizza Place,Coffee Shop,Athletics & Sports,Pool,Pub,Sandwich Place,Skating Rink,Pharmacy,Gym,Gluten-free Restaurant


## 4. Model Creation 

If a owner of small business want to open a new shop or restaurant, she/he can give a few key words. 
According this, keywords, generate a vector against venue categories. e.g.

if key word is 'donut', 'donut shop' will be set to 1 in the vector (1,273)

we fake a new neighborhood withe the new vector and start to cluster to see which cluster this will be

### Clustering without key words vector

In [245]:
from sklearn.cluster import AgglomerativeClustering

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kclusters = 10
cls = AgglomerativeClustering(n_clusters=kclusters).fit(toronto_grouped_clustering)

In [246]:
toronto_grouped_only_neighbour = toronto_grouped.loc[:,['Neighborhood']]
toronto_grouped_only_neighbour['Cluster Labels'] = cls.labels_

In [247]:
toronto_merged = pd.merge(toronto_data, toronto_grouped_only_neighbour, left_on='Neighbourhood', right_on='Neighborhood')
toronto_merged.drop('Neighborhood', axis=1,inplace=True)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = pd.merge(toronto_merged, neighborhoods_venues_sorted, left_on='Neighbourhood', right_on='Neighborhood')
toronto_merged.drop('Neighborhood', axis=1,inplace=True)
toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,"Rouge,Malvern",Scarborough,43.806686,-79.194353,4,Fast Food Restaurant,Yoga Studio,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
1,M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough,43.784535,-79.160497,0,Golf Course,Bar,Yoga Studio,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
2,M1E,"Guildwood,Morningside,West Hill",Scarborough,43.763573,-79.188711,0,Medical Center,Mexican Restaurant,Rental Car Location,Breakfast Spot,Electronics Store,Pizza Place,Empanada Restaurant,Ethiopian Restaurant,Event Space,Diner
3,M1G,Woburn,Scarborough,43.770992,-79.216917,0,Coffee Shop,Korean Restaurant,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
4,M1H,Cedarbrae,Scarborough,43.773136,-79.239476,0,Athletics & Sports,Hakka Restaurant,Fried Chicken Joint,Bakery,Thai Restaurant,Lounge,Caribbean Restaurant,Bank,Drugstore,Doner Restaurant


In [248]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Clustering by adding example vector

In [388]:
#Example 'pizza', assumpt that a place need to find to start a  small business about 'pizza'

keywords = ['pizza']
vector = np.zeros((1, categories_num))

for i, category in enumerate(categories):
    category_value = 0
    for keyword in keywords:
        if keyword in category:
            category_value = category_value + 1
    vector[0][i] = category_value

vec_sum = vector.sum()
vector = vector / vec_sum

#### add vector to the end row as a fake neighborhood

In [389]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
toronto_grouped_clustering = toronto_grouped_clustering.append(pd.DataFrame(vector, columns=toronto_grouped_clustering.columns))

#### Cluster , to see which cluster will  the new vector categorized

In [390]:
kclusters = 10
cls = AgglomerativeClustering(n_clusters=kclusters).fit(toronto_grouped_clustering)

In [392]:
toronto_grouped_only_neighbour = toronto_grouped.loc[:,['Neighborhood']]
toronto_grouped_only_neighbour['Cluster Labels'] = cls.labels_[:-1]

In [393]:

same_cluster = toronto_grouped_only_neighbour[toronto_grouped_only_neighbour['Cluster Labels'] == cls.labels_[-1]]
print('This is the suggested neighborhood to start you business:')
print(same_cluster['Neighborhood'].values)

This is the suggested neighborhood to start you business:
['The Junction North,Runnymede']


## 5. Verification

check the 'The Junction North,Runnymede'

In [396]:
toronto_merged = pd.merge(toronto_data, toronto_grouped_only_neighbour, left_on='Neighbourhood', right_on='Neighborhood')
toronto_merged.drop('Neighborhood', axis=1,inplace=True)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = pd.merge(toronto_merged, neighborhoods_venues_sorted, left_on='Neighbourhood', right_on='Neighborhood')
toronto_merged.drop('Neighborhood', axis=1,inplace=True)
toronto_merged[toronto_merged['Neighbourhood'] == 'The Junction North,Runnymede'] 

Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
79,M6N,"The Junction North,Runnymede",York,43.673185,-79.487262,3,Pizza Place,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store,College Stadium


So in this neighborhood, the most common Venue is Pizza Place, so it is a good choice to start business with pizza here