# My Capostone Project

Question, that was researched, was to find proper place for a new restorant. Investors decided to enter Toronto city with their own dinner place. Main criterias were absence of other restaurants, some sights nearby and availability of city transport.

### Problem
Place to open new restaurant.

### Criterias
1. Absense of other restaurants.
2. Availability of transport.
3. Sights nearby.

### Resourses
1. Database of Toronto neighborhoods.
2. Foursquare API.
3. IBM Watson Studio.

### Results
Neighborhoods, that meet criterias.

#### 1. Preparations

Lets import needed tools.

In [51]:
import numpy as np
import pandas as pd
import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!pip install folium
import folium



#### 2. Loading Toronto neighborhoods database.

Using python instruments of import, lets load data from Wikipedia.
Also, I cleaned it and attached geospatial data

In [52]:
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

toronto_data = pd.DataFrame(table[0])

toronto_data.drop(toronto_data[toronto_data['Borough']=='Not assigned'].index, inplace=True)
mask = toronto_data['Neighborhood'] == 'Non assigned'
toronto_data.loc[mask,'Neighborhood'] = toronto_data['Borough']
geo_data = pd.read_csv("https://cocl.us/Geospatial_data")
#toronto_data.head()
toronto_data = pd.merge(toronto_data, geo_data[['Postal Code','Latitude', 'Longitude']], left_on='Postalcode', right_on='Postal Code', how='left')
del toronto_data['Postal Code']
toronto_data.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494


#### 3. Lets assign needed parameters for Foursquare API.

In [53]:
RADIUS=200
CLIENT_ID = 'JZ3KDNPBMKNWBVOJ2QZU1LQGH4WGRYYYGYEV3SDOKQ0TZA4T' # your Foursquare ID
CLIENT_SECRET = 'K0Y05PEDAJ2ZESIW4UBHQXTGX3CXTLDRQN132XPUU3Q1M0KS' # your Foursquare Secret
VERSION = '20200101'

#### 4. Data extracting from Foursqueare for all neighborhoods.

In [54]:
toronto_venues = pd.DataFrame()

for index, row in toronto_data.iterrows():

    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET,
        row['Latitude'], 
        row['Longitude'], 
        VERSION,
        RADIUS)
    
    results = requests.get(url).json()
    venues = results['response']['venues']
    dataframe = json_normalize(venues)
    toronto_venues = pd.concat([toronto_venues,dataframe],sort=False)

Then some work of cleaning the original database.

In [55]:
filtered_columns = ['name', 'categories'] + [col for col in toronto_venues.columns if col.startswith('location.')] + ['id']
toronto_venues_f = toronto_venues.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
toronto_venues_f['categories'] = toronto_venues_f.apply(get_category_type, axis=1)

# clean column names by keeping only last term
toronto_venues_f.columns = [column.split('.')[-1] for column in toronto_venues_f.columns]

toronto_venues_f.head()

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,neighborhood,postalCode,state,id
0,TTC stop #8380,Bus Stop,Underhill Dr,CA,Toronto,Canada,At Cassandra N,273,"[Underhill Dr (At Cassandra N), Toronto ON, Ca...","[{'label': 'display', 'lat': 43.752672, 'lng':...",43.752672,-79.326351,,,ON,4e42684718a8627fce453c01
1,Brookbanks Park,Park,Toronto,CA,Toronto,Canada,,245,"[Toronto, Toronto ON, Canada]","[{'label': 'display', 'lat': 43.75197604605557...",43.751976,-79.33214,,,ON,4e8d9dcdd5fbbbb6b3003c7b
2,GTA Restoration | Emergency Water Damage Plumb...,Construction & Landscaping,250 Yonge St,CA,Toronto,Canada,401 & DVP,1741,"[250 Yonge St (401 & DVP), Toronto ON M5B 2L7,...","[{'label': 'display', 'lat': 43.7535666482373,...",43.753567,-79.351308,,M5B 2L7,ON,535fddb1498e03814e03968f
3,Toronto International College,College Communications Building,3550,CA,Toronto,Canada,McNoli Avenue,690,"[3550 (McNoli Avenue), Toronto ON, Canada]","[{'label': 'display', 'lat': 43.75053088657950...",43.750531,-79.337367,,,ON,51d85ca6498ea979a4d0f0c7
4,Yorkmills Wellness & Spa,Spa,25 Lesmill Road Suite 200,CA,North York,Canada,,524,"[25 Lesmill Road Suite 200, North York ON, Can...","[{'label': 'display', 'lat': 43.75680029671985...",43.7568,-79.325346,,,ON,54ee51de498e7a6fbe4f00a7


#### 5. Removing duplicates.

Currently, because of looping through nearby neighborhoods, we have an issue of venue duplications. We can deal with them using python tools.

In [56]:
print('Total number of venues',toronto_venues_f.shape[0])
print('Unique venues',toronto_venues_f['name'].nunique())

Total number of venues 3090
Unique venues 2792


In [57]:
toronto_venues_f.drop_duplicates(subset=['id'],inplace=True)
toronto_venues_f.dropna(subset=['postalCode'],inplace=True)
print('Number of unique venues with postal codes',toronto_venues_f.shape)

Number of unique venues with postal codes (1347, 16)


#### 6. Grouping categories.

There are lots of different categories in Foursquare data. It should be consolidated into clusters for farther analysis. Lets export data into csv, group it manually, and load back new file back to the project.

In [58]:
categories_group = pd.DataFrame(toronto_venues_f['categories'].unique())
categories_group.rename(columns={0:'categories'},inplace=True)
categories_group.to_csv('categories.csv')

In [59]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,categories,categories_group
0,Construction & Landscaping,Other
1,Bus Stop,Transport
2,Residential Building (Apartment / Condo),Apartments
3,Sandwich Place,Restaurants
4,Caribbean Restaurant,Restaurants


Now lets attach categories to the main database.

In [60]:
toronto_venues_f=pd.merge(toronto_venues_f[['name','categories','postalCode','lat','lng']],
                           categories_group,
                           left_on='categories',
                           right_on='categories',
                           how='left')
toronto_venues_f.head()

Unnamed: 0,name,categories,postalCode,lat,lng,categories_group
0,GTA Restoration | Emergency Water Damage Plumb...,Construction & Landscaping,M5B 2L7,43.753567,-79.351308,Other
1,Bruno's valu-mart,Grocery Store,M3A 2P5,43.746143,-79.32463,Shops
2,CAPREIT Apartments,Residential Building (Apartment / Condo),M3A 1S6,43.75392,-79.3224,Apartments
3,CAPREIT Apartments,Residential Building (Apartment / Condo),M3A 1Y6,43.75318,-79.33897,Apartments
4,D2 Designs,Coworking Space,M1P 4V4,43.754568,-79.332035,Other


In [61]:
toronto_venues_f['categories_group'].fillna('Other', inplace=True)
print('Number of ungrouped categories',toronto_venues_f['categories_group'].isna().sum())

Number of ungrouped categories 0


#### 7. Binarize categories.

To use cluster method, we should have categories binarized. Thats why I created separate binarized table, and then attached it to the main database.

 I used only few categories for analysis - Apartments, Restaurants, Shops, Sights and Transport. Most of them are in criterias. Other two are just bonus recommendation.

In [62]:
categories=['Shops','Restaurants','Sights','Apartments','Transport']
pd_category = pd.get_dummies(categories)
pd_category['categories_group']=pd.DataFrame({'categories_group':categories})
pd_category.head()

Unnamed: 0,Apartments,Restaurants,Shops,Sights,Transport,categories_group
0,0,0,1,0,0,Shops
1,0,1,0,0,0,Restaurants
2,0,0,0,1,0,Sights
3,1,0,0,0,0,Apartments
4,0,0,0,0,1,Transport


Now, lets merge venues database with new categories columns.

In [63]:
toronto_venues_f = pd.merge(toronto_venues_f, 
                              pd_category, 
                              left_on='categories_group', 
                              right_on='categories_group', 
                              how='left')
toronto_venues_f.head()

Unnamed: 0,name,categories,postalCode,lat,lng,categories_group,Apartments,Restaurants,Shops,Sights,Transport
0,GTA Restoration | Emergency Water Damage Plumb...,Construction & Landscaping,M5B 2L7,43.753567,-79.351308,Other,,,,,
1,Bruno's valu-mart,Grocery Store,M3A 2P5,43.746143,-79.32463,Shops,0.0,0.0,1.0,0.0,0.0
2,CAPREIT Apartments,Residential Building (Apartment / Condo),M3A 1S6,43.75392,-79.3224,Apartments,1.0,0.0,0.0,0.0,0.0
3,CAPREIT Apartments,Residential Building (Apartment / Condo),M3A 1Y6,43.75318,-79.33897,Apartments,1.0,0.0,0.0,0.0,0.0
4,D2 Designs,Coworking Space,M1P 4V4,43.754568,-79.332035,Other,,,,,


In [64]:
toronto_venues_f = toronto_venues_f.loc[toronto_venues_f['categories_group'].isin(categories)]
toronto_venues_f

Unnamed: 0,name,categories,postalCode,lat,lng,categories_group,Apartments,Restaurants,Shops,Sights,Transport
1,Bruno's valu-mart,Grocery Store,M3A 2P5,43.746143,-79.324630,Shops,0.0,0.0,1.0,0.0,0.0
2,CAPREIT Apartments,Residential Building (Apartment / Condo),M3A 1S6,43.753920,-79.322400,Apartments,1.0,0.0,0.0,0.0,0.0
3,CAPREIT Apartments,Residential Building (Apartment / Condo),M3A 1Y6,43.753180,-79.338970,Apartments,1.0,0.0,0.0,0.0,0.0
5,Subway,Sandwich Place,M3A 1Z5,43.760334,-79.326906,Restaurants,0.0,1.0,0.0,0.0,0.0
8,Eagle Bridge,Bridge,M3A,43.750453,-79.332259,Sights,0.0,0.0,0.0,1.0,0.0
9,Fenside Avenue,Bus Stop,M3A 2V3,43.760582,-79.327640,Transport,0.0,0.0,0.0,0.0,1.0
12,Allwyn's Bakery,Caribbean Restaurant,M3A 1Z5,43.759840,-79.324719,Restaurants,0.0,1.0,0.0,0.0,0.0
16,Latvian Cultural Centre,Cultural Center,M4A 2N8,43.725677,-79.318248,Sights,0.0,0.0,0.0,1.0,0.0
17,Tim Hortons,Coffee Shop,M4A 1J8,43.725517,-79.313103,Shops,0.0,0.0,1.0,0.0,0.0
19,Pizza Nova,Pizza Place,M4A 1J8,43.725824,-79.312860,Restaurants,0.0,1.0,0.0,0.0,0.0


Venues database postal codes are not unified. Lets fix it.

In [65]:
toronto_venues_f['postalCode'] = toronto_venues_f['postalCode'].str.split(' ').str[0]
[['postalCode']].head()

Unnamed: 0,postalCode
1,M3A
2,M3A
3,M3A
5,M3A
8,M3A


##### 8. Creating Final database.

Now we can create calculation of venues, grouped by postal code. This will let us analyze neighborhoods, knowing how many venues, and of what type, is near every neighborhood.

In [66]:
venues_grouped = toronto_venues_f[['postalCode']+categories].groupby(['postalCode']).sum()
venues_grouped.head()

Unnamed: 0_level_0,Shops,Restaurants,Sights,Apartments,Transport
postalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14225,1.0,0.0,0.0,0.0,0.0
CA,0.0,0.0,1.0,0.0,0.0
L3T,1.0,1.0,0.0,0.0,0.0
L4K,0.0,1.0,0.0,0.0,0.0
L4W,2.0,2.0,0.0,3.0,1.0


In [67]:
toronto_f = pd.merge(toronto_data, venues_grouped, left_on='Postalcode', right_on='postalCode',how='left').fillna(0)
toronto_f.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Shops,Restaurants,Sights,Apartments,Transport
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,2.0,1.0,2.0,1.0
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,1.0,1.0,0.0,0.0
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,7.0,4.0,1.0,0.0,0.0
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763,10.0,1.0,0.0,0.0,1.0
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494,0.0,0.0,0.0,0.0,0.0


#### 9. Creating map of Toronto before clustering.

Lets visualize, with what we are dealing now.

In [73]:
latitude=43.651070
longitude=-79.347015
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_f['Latitude'], 
                                           toronto_f['Longitude'], 
                                           toronto_f['Borough'], 
                                           toronto_f['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### 10. Clustering Neighborhoods.

So, when we have finished data preparations, we can analyze neighborhoods. Lets cluster it. I used 5 clusters.

In [69]:
kclusters = 5

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_f[['Latitude','Longitude']+categories])

toronto_f['Cluster']=kmeans.labels_

As we can see, 0, 1, 2 and 4th clusters are occupied with lots of restaurants. This is not our case. 
Cluster 3 has only 10 restaurants in 32 neighborhoods. Also, there are 12 sights and access to transport. I think, this is our case.

In [70]:
print(toronto_f[['Cluster']+categories].groupby('Cluster').sum())
toronto_f.groupby('Cluster').describe()

         Shops  Restaurants  Sights  Apartments  Transport
Cluster                                                   
0         79.0         72.0    15.0         9.0        8.0
1         76.0         30.0     2.0         1.0        3.0
2         21.0         68.0     2.0         7.0        0.0
3         18.0         10.0    12.0         5.0        7.0
4        104.0         61.0     9.0        15.0        9.0


Unnamed: 0_level_0,Latitude,Latitude,Latitude,Latitude,Latitude,Latitude,Latitude,Latitude,Longitude,Longitude,...,Apartments,Apartments,Transport,Transport,Transport,Transport,Transport,Transport,Transport,Transport
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,36.0,43.721337,0.054285,43.602414,43.683106,43.719261,43.764675,43.815252,36.0,-79.416735,...,0.0,2.0,36.0,0.222222,0.484686,0.0,0.0,0.0,0.0,2.0
1,8.0,43.7131,0.056032,43.648429,43.66505,43.711421,43.744591,43.799525,8.0,-79.379519,...,0.0,1.0,8.0,0.375,0.517549,0.0,0.0,0.0,1.0,1.0
2,8.0,43.663953,0.024247,43.644771,43.647739,43.651033,43.674139,43.70906,8.0,-79.366324,...,1.5,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,32.0,43.706044,0.04965,43.628947,43.670218,43.696146,43.732849,43.836125,32.0,-79.388752,...,0.0,1.0,32.0,0.21875,0.420013,0.0,0.0,0.0,0.0,1.0
4,19.0,43.684034,0.048937,43.605647,43.652915,43.669005,43.719325,43.7942,19.0,-79.394605,...,0.0,6.0,19.0,0.473684,0.964274,0.0,0.0,0.0,1.0,4.0


Lets look at those neighborhoods, and what sights and transports they have.

In [71]:
potential_place = toronto_f.loc[(toronto_f['Cluster']==3)&
                                  (toronto_f['Restaurants']==0)&
                                  (toronto_f['Sights']>0)&
                                  (toronto_f['Transport']>0)]
potential_place

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Shops,Restaurants,Sights,Apartments,Transport,Cluster
14,M4C,East York,Woodbine Heights,43.695344,-79.318389,2.0,0.0,2.0,0.0,1.0,3
53,M3M,North York,Downsview,43.728496,-79.495697,2.0,0.0,1.0,0.0,1.0,3
74,M5R,Central Toronto,The Annex / North Midtown / Yorkville,43.67271,-79.405678,0.0,0.0,1.0,1.0,1.0,3
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...,43.636258,-79.498509,0.0,0.0,1.0,1.0,1.0,3


Lets check, what venues exactly are situated in this neighborhoods.

In [100]:
#toronto_venues_f
pot_pl_venues = pd.DataFrame(toronto_venues_f.query('postalCode in (\'M4C\',\'M3M\',\'M5R\',\'M8Y\')'))
pot_pl_venues[['name','categories_group','postalCode']].sort_values(by='categories_group')

Unnamed: 0,name,categories_group,postalCode
958,The Annex,Apartments,M5R
1333,2 kinsdale blvd,Apartments,M8Y
197,The best backyard ever!,Shops,M4C
198,The Beer Store,Shops,M4C
717,Tim Hortons,Shops,M3M
720,Tim Hortons,Shops,M3M
59,Queen's Park,Sights,M5R
200,Stan Wadlow Park,Sights,M4C
208,Les Anthony Parkette,Sights,M4C
725,113 Tavistock Rd.,Sights,M3M


#### 11. Illustrating results.

Well, mosts of sights are parks, but that's can be promissing. People, after having good walk on the fresh air, would wish to have a meal. And there are no restaurants nearby any of those. Each neighborhood has transport station. So This should deal. Lets check results on map (recommended neighborhoods will be highlighted).

In [72]:
map_clusters = folium.Map(location=[43.651070, -79.347015], 
                          zoom_start=11, 
                          #tiles='Stamen Toner',
                         )

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_f['Latitude'], 
                                  toronto_f['Longitude'], 
                                  toronto_f['Neighborhood'], 
                                  toronto_f['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
for lat, lon, poi, cluster in zip(potential_place['Latitude'], 
                                  potential_place['Longitude'], 
                                  potential_place['Neighborhood'], 
                                  potential_place['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color='black',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.9).add_to(map_clusters)

map_clusters