# The Battle of Neighborhoods
__Applied Data Science Capstone by IBM/Coursera__  

## Table of contents
* [Introduction: Business Problem](#introduction)
* [About Data](#data)
* [EDA](#analysis)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for an italian restaurant. Specifically, this report will be targeted to stakeholders interested in opening an intalian restaurant in Toronto, Canada.

Since there are lots of restaurants in Toronto we will try to detect locations on the basis of 
1. not already crowded with restaurants. 
2. areas with no restaurants in vicinity.  
3. close to city center(downtown toronto) as possible, assuming that first two conditions are met.

We will use our data science knowledge to generate a few most promising neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## About data <a name="data"></a>

For this problem we'll be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, the neighborhood data is not readily available on the internet.
For the Toronto neighborhood data, a Wikipedia page exists [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M] that had all the information I need to explore and cluster the neighborhoods in Toronto. I scraped the Wikipedia page and wrangled the data, cleaned it, and then read it into a pandas dataframe so that it is in a structured format to proceed with my analysis.  
Toronto, the capital of the province of Ontario, is a major Canadian city along Lake Ontario’s northwestern shore. It's a dynamic metropolis with a core of soaring skyscrapers, all dwarfed by the iconic, free-standing CN Tower. Toronto also has many green spaces, from the orderly oval of Queen’s Park to 400-acre High Park and its trails, sports facilities and zoo.

Based on definition of our problem, factors that will influence our decision are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* number of and distance to Italian restaurants in the neighborhood, if any
* distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* For coordinated we'll be using a csv file available with postalcode and latitude and longitude information.
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of center will be obtained using **Google Maps API geocoding**.

### Neighborhood Candidates

Let's first scrape the data for toronto city using beautiful soap and request libraries of python.
Then we'll clean the data for the 

In [1]:
# Importing Libraries
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import numpy as np
import geocoder # import geocoder
from geopy.geocoders import Nominatim
import folium # map rendering library
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
from re import search
from folium import plugins
from folium.plugins import HeatMap
from math import cos, sin, atan2, sqrt,asin,pi
import pyproj

## Creating empty list to save 3 types of value
__td contains all the values of table__  
__removing '\n'__  
__breaking after getting all the required values__ 

In [2]:
p=[]
b=[]
n=[]
dict1={1:p,2:b,3:n}
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup = bs(response.text, 'html.parser')
ind=1
for i in soup.findAll('td'):
    if i.text.replace('\n','')=='':
        break
    dict1[ind].append(i.text.replace('\n',''))
    if(ind<3):
        ind+=1
    else:
        ind=1
    
df=pd.DataFrame(dict1)        

In [3]:
## The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
## Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df=df.rename(columns={1:'PostalCode',2:'Borough',3:'Neighborhood'})
df=df.replace('Not assigned',np.nan)
df.dropna(subset=['Borough'],inplace=True)

In [4]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
print('Dataframe contains {} rows and {} columns'.format(df.shape[0],df.shape[1]))

Dataframe contains 103 rows and 3 columns


In [6]:
# Getting lat and long using another csv file

In [7]:
loc_df=pd.read_csv('Geospatial_Coordinates.csv')

In [8]:
df=pd.merge(df, loc_df, left_on='PostalCode',right_on='Postal Code')
df.drop('Postal Code',axis=1,inplace=True)

In [9]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## EDA <a name="analysis"></a>

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(len(df['Borough'].unique()),df.shape[0]))

The dataframe has 10 boroughs and 103 neighborhoods.


___Use geopy library to get the latitude and longitude values of Canada.___  

In [11]:
## Downtown Toronto is considered center of toronto
# let's find center coordinates of toronto

address = 'Downtown Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
downtown_latitude = location.latitude
downtown_longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(downtown_latitude, downtown_longitude))


The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [12]:
# create map of Toronto Downtown using latitude and longitude values
map_canada = folium.Map(location=[downtown_latitude, downtown_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
    
map_canada

___Define Foursquare Credentials and Version___  

In [13]:
CLIENT_ID = 'ZZWEDFHEJ3XRRUIFX114VECQL2MQBVWMVVEPARXM3210YT2M' # your Foursquare ID
CLIENT_SECRET = '3RUGSF0PKLLXI0XL0AVH3B1DBVBBUQL5UX1SAXJH2EH0P3YS' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ZZWEDFHEJ3XRRUIFX114VECQL2MQBVWMVVEPARXM3210YT2M
CLIENT_SECRET:3RUGSF0PKLLXI0XL0AVH3B1DBVBBUQL5UX1SAXJH2EH0P3YS


In [14]:
df.loc[1, 'Neighborhood']

'Victoria Village'

In [15]:
neighborhood_latitude = df.loc[1, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[1, 'Longitude'] # neighborhood longitude value

neighborhood_name = df.loc[1, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Victoria Village are 43.725882299999995, -79.31557159999998.


In [16]:
# type your answer here
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=ZZWEDFHEJ3XRRUIFX114VECQL2MQBVWMVVEPARXM3210YT2M&client_secret=3RUGSF0PKLLXI0XL0AVH3B1DBVBBUQL5UX1SAXJH2EH0P3YS&v=20180605&ll=43.725882299999995,-79.31557159999998&radius=500&limit=100'

In [17]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f545a01bf216839f1b4d9b0'},
 'response': {'headerLocation': 'Bermondsey',
  'headerFullLocation': 'Bermondsey, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 6,
  'suggestedBounds': {'ne': {'lat': 43.7303823045, 'lng': -79.30935618239715},
   'sw': {'lat': 43.72138229549999, 'lng': -79.32178701760282}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4c633acb86b6be9a61268e34',
       'name': 'Victoria Village Arena',
       'location': {'lat': 43.72348055545508,
        'lng': -79.31563520925143,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.72348055545508,
          'lng': -79.31563520925143}],
        'distance': 267,
        'cc': 'CA',
        'country': 'Canada',
        'formatte

In [18]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [19]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Victoria Village Arena,Hockey Arena,43.723481,-79.315635
1,Tim Hortons,Coffee Shop,43.725517,-79.313103
2,Portugril,Portuguese Restaurant,43.725819,-79.312785
3,Eglinton Ave E & Sloane Ave/Bermondsey Rd,Intersection,43.726086,-79.31362
4,Pizza Nova,Pizza Place,43.725824,-79.31286


In [20]:
## Lets get all the values around downtown toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    
    return(nearby_venues)

In [22]:
Venues = getNearbyVenues(names=df['Neighborhood'],latitudes=df['Latitude'],longitudes=df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [23]:
print(Venues.shape)
Venues.head()

(2157, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [24]:
def converter(x):
    if x=='Italian Restaurant':
        return x
    else:
        if search('Restaurant', x) or x in ['Diner','Steakhouse']:
            return 'Restaurant'
        else:
            return x

In [25]:
Venues['Venue Category'] = Venues['Venue Category'].apply(converter)
venues_rest = Venues[(Venues['Venue Category']=='Restaurant') | (Venues['Venue Category']=='Italian Restaurant')]

In [26]:
italian_loc = Venues.loc[Venues['Venue Category']=='Italian Restaurant',['Venue Latitude', 'Venue Longitude']]
italian_loc.reset_index(inplace=True,drop=True)
rest_loc = Venues.loc[Venues['Venue Category']=='Restaurant',['Venue Latitude', 'Venue Longitude']]
rest_loc.reset_index(inplace=True,drop=True)
restaurants = len(rest_loc)
italian_restaurants = len(italian_loc)
print('Total number of restaurants:', restaurants)
print('Total number of Italian restaurants:', italian_restaurants)
print('Percentage of Italian restaurants: {:.2f}%'.format(italian_restaurants/restaurants * 100))

Total number of restaurants: 483
Total number of Italian restaurants: 40
Percentage of Italian restaurants: 8.28%


___This is good number to start as no is very less.___  

Looking good. So now we have all the restaurants in area within few kilometers from Downtown toroto, and we know which ones are Italian restaurants! We also know which restaurants exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new Italian restaurant!

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Berlin that have low restaurant density, particularly those with low number of Italian restaurants. We will limit our analysis to area ~6km around city center.

In first step we have collected the required **data: location and type (category) of every restaurant close to Downtown toronto**. We have also **identified Italian restaurants** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**restaurant density**' across different areas near downtown - we will use **heatmaps** to identify a few promising areas close to downtown with low number of restaurants in general (*and* no Italian restaurants in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **no more than two restaurants in radius of 250 meters**, and we want locations **without Italian restaurants in radius of 400 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>

Let's perform some advance explanatory data analysis and derive some additional info from our raw data. First let's count the **number of restaurants in every area candidate**:

In [27]:
map_canada = folium.Map(location=[downtown_latitude, downtown_longitude], zoom_start=13)
folium.Marker([downtown_latitude, downtown_longitude], popup='Downtown Toronto').add_to(map_canada)
for lat, lng, category, neighborhood, venue in zip(venues_rest['Venue Latitude'], venues_rest['Venue Longitude'], venues_rest['Venue Category'], venues_rest['Neighborhood'], venues_rest['Venue']):
    label = '{} ,{}, {}'.format(venue ,neighborhood, 'Downtown Toronto')
    label = folium.Popup(label, parse_html=True)
    color = 'red' if category=='Italian Restaurant' else 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=color,
        fill=True,
        fill_color = color,
        fill_opacity=0.7,).add_to(map_canada)  
map_canada

In [28]:
rest_count = venues_rest.groupby('Neighborhood')['Venue Category'].value_counts().unstack()
rest_count.index.name='Neighborhood'
rest_count.reset_index(inplace=True)
rest_table = pd.merge(rest_count,Venues[['Neighborhood','Neighborhood Latitude','Neighborhood Longitude']].drop_duplicates(),on='Neighborhood',how='left')
rest_table.rename(columns={'Neighborhood Latitude':'Latitude','Neighborhood Longitude':'Longitude'},inplace=True)
rest_table

Unnamed: 0,Neighborhood,Italian Restaurant,Restaurant,Latitude,Longitude
0,Agincourt,,1.0,43.794200,-79.262029
1,"Bathurst Manor, Wilson Heights, Downsview North",,5.0,43.754328,-79.442259
2,Bayview Village,,2.0,43.786947,-79.385975
3,"Bedford Park, Lawrence Manor East",2.0,7.0,43.733283,-79.419750
4,Berczy Park,1.0,15.0,43.644771,-79.373306
...,...,...,...,...,...
63,Victoria Village,,1.0,43.725882,-79.315572
64,Westmount,,2.0,43.696319,-79.532242
65,"Wexford, Maryvale",,1.0,43.750072,-79.295849
66,"Willowdale, Willowdale East",,12.0,43.770120,-79.408493


In [29]:
print('Number of localities with no italian restaurant is',format(rest_table['Italian Restaurant'].isna().sum()))

Number of localities with no italian restaurant is 42


In [30]:
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def closest(data, v_lat, v_lang):
    return(min(data, key=lambda p: distance(v_lat,v_lang,p['Venue Latitude'],p['Venue Longitude'])))
    #return distance(a['Venue Latitude'],a['Venue Longitude'],v_lat,v_lang)

In [31]:
rest_table['distance from center'] = rest_table.apply(lambda x:distance(downtown_latitude, downtown_longitude, x.Latitude, x.Longitude),axis=1)

In [32]:
dict1 = italian_loc.to_dict('records')
rest_table['distance from italian restaurant'] = rest_table.apply(lambda x:closest(dict1, x.Latitude, x.Longitude),axis=1)
rest_table['italian Latitude']=rest_table['distance from italian restaurant'].apply(lambda x:x['Venue Latitude'])
rest_table['italian Longitude']=rest_table['distance from italian restaurant'].apply(lambda x:x['Venue Longitude'])
rest_table['distance from italian restaurant'] = rest_table.apply(lambda x:distance(x['italian Latitude'], x['italian Longitude'], x.Latitude, x.Longitude),axis=1)

In [33]:
rest_table

Unnamed: 0,Neighborhood,Italian Restaurant,Restaurant,Latitude,Longitude,distance from center,distance from italian restaurant,italian Latitude,italian Longitude
0,Agincourt,,1.0,43.794200,-79.262029,18.064218,4.094519,43.778649,-79.308264
1,"Bathurst Manor, Wilson Heights, Downsview North",,5.0,43.754328,-79.442259,11.961447,2.856828,43.734557,-79.419549
2,Bayview Village,,2.0,43.786947,-79.385975,14.530548,6.306445,43.778649,-79.308264
3,"Bedford Park, Lawrence Manor East",2.0,7.0,43.733283,-79.419750,9.109348,0.088429,43.734073,-79.419870
4,Berczy Park,1.0,15.0,43.644771,-79.373306,1.422903,0.259381,43.646964,-79.374403
...,...,...,...,...,...,...,...,...,...
63,Victoria Village,,1.0,43.725882,-79.315572,9.350351,2.124195,43.726575,-79.341989
64,Westmount,,2.0,43.696319,-79.532242,12.957155,5.840721,43.655191,-79.487067
65,"Wexford, Maryvale",,1.0,43.750072,-79.295849,12.467164,3.330397,43.778649,-79.308264
66,"Willowdale, Willowdale East",,12.0,43.770120,-79.408493,12.846383,4.052910,43.734557,-79.419549


In [35]:
restaurant_latlons = [list(row) for row in rest_loc.itertuples()]
italian_latlons = [list(row) for row in italian_loc.itertuples()]

In [37]:
# For restaurant
map_canada = folium.Map(location=[downtown_latitude, downtown_longitude], zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_canada) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_canada)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
folium.Circle([downtown_latitude, downtown_longitude], radius=1000, fill=False, color='white').add_to(map_canada)
folium.Circle([downtown_latitude, downtown_longitude], radius=2000, fill=False, color='white').add_to(map_canada)
folium.Circle([downtown_latitude, downtown_longitude], radius=3000, fill=False, color='white').add_to(map_canada)
for lat, lng, borough in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format( borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
map_canada

In [38]:
# For italian restaurant
map_canada = folium.Map(location=[downtown_latitude, downtown_longitude], zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_canada) #cartodbpositron cartodbdark_matter
HeatMap(italian_latlons).add_to(map_canada)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
folium.Circle([downtown_latitude, downtown_longitude], radius=1000, fill=False, color='white').add_to(map_canada)
folium.Circle([downtown_latitude, downtown_longitude], radius=2000, fill=False, color='white').add_to(map_canada)
folium.Circle([downtown_latitude, downtown_longitude], radius=3000, fill=False, color='white').add_to(map_canada)
for lat, lng, borough in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format( borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
map_canada

In [39]:
good_res_count = np.array((rest_table['Restaurant']<=2))
print('Locations with no more than two restaurants nearby:', good_res_count.sum())

good_ita_distance = np.array(rest_table['distance from italian restaurant']>=2)
print('Locations with no Italian restaurants within 200m:', good_ita_distance.sum())

good_locations = np.logical_and(good_res_count, good_ita_distance)
print('Locations with both conditions met:', good_locations.sum())

df_good_locations = rest_table[good_locations]
df_good_locations.reset_index(inplace=True, drop=True)

Locations with no more than two restaurants nearby: 26
Locations with no Italian restaurants within 200m: 34
Locations with both conditions met: 21


In [40]:
df_good_locations

Unnamed: 0,Neighborhood,Italian Restaurant,Restaurant,Latitude,Longitude,distance from center,distance from italian restaurant,italian Latitude,italian Longitude
0,Agincourt,,1.0,43.7942,-79.262029,18.064218,4.094519,43.778649,-79.308264
1,Bayview Village,,2.0,43.786947,-79.385975,14.530548,6.306445,43.778649,-79.308264
2,"Cliffside, Cliffcrest, Scarborough Village West",,1.0,43.716316,-79.239476,13.185121,8.220474,43.666645,-79.315204
3,Downsview,,1.0,43.737473,-79.464763,11.263401,3.626593,43.734073,-79.41987
4,Downsview,,1.0,43.739015,-79.506944,13.682129,7.01715,43.734073,-79.41987
5,Downsview,,1.0,43.728496,-79.495697,12.229958,6.124078,43.734073,-79.41987
6,Downsview,,1.0,43.761631,-79.520999,16.245002,8.682089,43.734073,-79.41987
7,"Forest Hill North & West, Forest Hill Road Park",,1.0,43.696948,-79.411307,5.136154,2.009218,43.704558,-79.388639
8,"Guildwood, Morningside, West Hill",,2.0,43.763573,-79.188711,19.516357,9.744735,43.778649,-79.308264
9,Hillcrest Village,,1.0,43.803762,-79.363452,16.454536,5.236503,43.778649,-79.308264


In [41]:
def center_geolocation(geolocations):
    
    x = 0
    y = 0
    z = 0
    p = 0.01745329251

    for lat, lon in geolocations:
        lat = lat*p
        lon = lon*p
        lat = float(lat)
        lon = float(lon)
        x += cos(lat) * cos(lon)
        y += cos(lat) * sin(lon)
        z += sin(lat)

    x = float(x / len(geolocations))
    y = float(y / len(geolocations))
    z = float(z / len(geolocations))

    return (atan2(z, sqrt(x * x + y * y))*(1/p), atan2(y, x)*(1/p))

In [43]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values
good_loc_coords = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]
good_loc_center = center_geolocation(good_loc_coords)
good_loc_center

(43.73994192164354, -79.38149898247876)

In [44]:
good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

map_canada = folium.Map(location=good_loc_center, zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(map_canada)
HeatMap(restaurant_latlons).add_to(map_canada)
folium.Circle(good_loc_center, radius=3000, color='white', fill=True, fill_opacity=0.8).add_to(map_canada)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_canada) 
for lat, lng, borough in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format( borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
map_canada

In [45]:
map_canada = folium.Map(location=good_loc_center, zoom_start=11)
HeatMap(good_locations, radius=25).add_to(map_canada)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_canada)
for lat, lng, borough in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format( borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
map_canada

Looking good. What we have now is a clear indication of zones with low number of restaurants in vicinity, and no Italian restaurants at all nearby.

Let us now cluster those locations to create centers of zones containing good locations. Those zones, their centers and addresses will be the final result of our analysis.

In [65]:
from sklearn.cluster import KMeans

number_of_clusters = 6

good_xys = df_good_locations[['Latitude', 'Longitude']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)
cluster_centers = [[cc[0], cc[1]] for cc in kmeans.cluster_centers_]
map_canada = folium.Map(location=good_loc_center, zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(map_canada)
HeatMap(restaurant_latlons).add_to(map_canada)
folium.Circle(good_loc_center, radius=2500, color='white', fill=True, fill_opacity=0.75).add_to(map_canada)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=50, color='yellow', fill=True, fill_opacity=0.75).add_to(map_canada) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=10, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_canada)
for lat, lng, borough in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format( borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
map_canada

In [66]:
map_canada = folium.Map(location=good_loc_center, zoom_start=11)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_canada)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_canada)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='red', fill=True).add_to(map_canada) 
for lat, lng, borough in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format( borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
map_canada

## Locations for italian restaurants 

In [60]:
map_canada = folium.Map(location=good_loc_center, zoom_start=11)
folium.Marker([downtown_latitude, downtown_longitude]).add_to(map_canada)
for lat, lng, neighborhood in zip(df_good_locations['Latitude'], df_good_locations['Longitude'], df_good_locations['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=1,
        parse_html=False).add_to(map_canada) 
map_canada

## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of restaurants in Toronto (~800 in our initial area of interest which was 12x12km around downtown toronto), there are pockets of low restaurant density fairly close to city center. We have high availablity near airport where there is less no of restaurants.

Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.

Result of all this is 8 zones containing largest number of potential new restaurant locations based on number of and distance to existing venues - both restaurants in general and Italian restaurants particularly. This, of course, does not imply that those zones are actually optimal locations for a new restaurant! Purpose of this analysis was to only provide info on areas close to downtown toronto but not crowded with existing restaurants (particularly Italian) - it is entirely possible that there is a very good reason for small number of restaurants in any of those areas, reasons which would make them unsuitable for a new restaurant regardless of lack of competition in the area. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition but also other factors taken into account and all other relevant conditions met.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Toronto area close to downtown with low number of restaurants (particularly Italian restaurants) in order to aid stakeholders in narrowing down the search for optimal location for a new Italian restaurant. By calculating restaurant density distribution from Foursquare data we have first identified general boroughs that justify further analysis , and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby restaurants. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.