### <p style="text-align: center;">The purpose of this notebook is to build the final assignment of the IBM courses for Data Science :</p>
## <p style="text-align: center;">The capstone project</p>

In [1]:
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


# Introduction/Business Problem

   I imagine working with a company offering business analysis to others. Our current client is a person whose wish is to create his own sushi restaurant in Toronto. He does not have a particular preference about the location of his future restaurant, but would like to have it placed in a strategic position, without any cost consideration at first. That is why he asked us to help him : finding the neighborhood in Toronto which is the most interesting considering both:
* Presence of potential clients
* Competition with similar restaurants

# Data

To deal with this problem, I will be using the Foursquare location data. The idea is to make requests to get the venues in each of Toronto's neighborhoods. Then we will be able to work with them. In order to locate the neighborhoods and make Foursquare requests, I will first use the Wikipedia page listing these neighborhoods per postal codes, and Geocoder to get the coordinates I need.
As an exemple of the whole process of data collection, let's consider the first neighborhood in the Wikipedia page : Parkwoods. I collect the name and postal code, which I use to find the coordinates of the neighborhood with Geocoder. Then with this postal code I am able to find all the venues nearby thanks to Foursquare's API. This goes right into a dataframe, and I have to repeat the process for all neighborhoods.

# Methodology

## Wikipedia scraping

In [1]:
import urllib.request

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
page = urllib.request.urlopen(url)

In [4]:
from bs4 import BeautifulSoup

In [5]:
soup = BeautifulSoup(page, "lxml")

In [6]:
right_table=soup.find('table', class_='wikitable sortable')

In [7]:
A=[]
B=[]
C=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

## Pandas processing

In [8]:
import pandas as pd

In [9]:
df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighbourhood']=C
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Extracting the rows with Borough not assigned :

In [22]:
toronto_df = df[df['Borough']!='Not assigned\n']
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Checking if there is any Neighbourhood not assigned left :

In [23]:
toronto_df[toronto_df['Neighbourhood']=='Not assigned\n']

Unnamed: 0,Postal Code,Borough,Neighbourhood


There isn't ! Our dataframe is almost ready.

Just a removal of unnecessary "\n" at the end of each line, and there we go :

In [24]:
toronto_df = toronto_df.replace('\n','', regex=True)

## Mapping of the bouroughs

In [14]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 8.1MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [15]:
import geocoder

First we need the coordinates of each borough :

In [25]:
latitude=[]
longitude=[]

for n in range(len(toronto_df)):
    
    postal_code = toronto_df.iloc[n,0]

    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng

    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])
    
toronto_df['Latitude'] = latitude
toronto_df['Longitude'] = longitude

Our dataframe is ready :

In [17]:
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.75188,-79.33036
3,M4A,North York,Victoria Village,43.73042,-79.31282
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302


In [20]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 8.4MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [21]:
import folium
import requests
from pandas.io.json import json_normalize

In [36]:
latitude = toronto_df['Latitude'].mean(axis=0)
longitude = toronto_df['Longitude'].mean(axis=0)

In [37]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, neighbourhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Extracting and exploring our dataset

Let's explore a first neighborhood to see how our data is structured :

In [38]:
# The code was removed by Watson Studio for sharing.

In [39]:
neighborhood_latitude = toronto_df.loc[0, 'Latitude']
neighborhood_longitude = toronto_df.loc[0, 'Longitude']

neighborhood_name = toronto_df.loc[0, 'Neighbourhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.75188000000003, -79.33035999999998.


In [40]:
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

In [41]:
results = requests.get(url).json()

In [42]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Here are the first venues found in Parkwoods :

In [43]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,PetSmart,Pet Store,43.748639,-79.333488
2,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [44]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


In [46]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [47]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighbourhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

In [77]:
print(toronto_venues.shape)
toronto_venues.head()

(2413, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.75188,-79.33036,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.75188,-79.33036,PetSmart,43.748639,-79.333488,Pet Store
2,Parkwoods,43.75188,-79.33036,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.73042,-79.31282,Memories of Africa,43.726602,-79.312427,Grocery Store
4,Victoria Village,43.73042,-79.31282,The Retreat Nail & Beauty Bar,43.726134,-79.312205,Nail Salon


In [87]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 265 uniques categories.


We extract only the restaurants from the dataframe :

In [97]:
restaurant_df = toronto_venues[toronto_venues['Venue Category'].str.find('Restaurant')>-1]
restaurant_df.reset_index(inplace=True)
print(restaurant_df.shape)
restaurant_df.head()

(553, 8)


Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,16,"Regent Park, Harbourfront",43.65514,-79.36265,Sukhothai,43.658444,-79.365681,Thai Restaurant
1,20,"Regent Park, Harbourfront",43.65514,-79.36265,Mangia and Bevi Resto-Bar,43.65225,-79.366355,Italian Restaurant
2,21,"Regent Park, Harbourfront",43.65514,-79.36265,Impact Kitchen,43.656369,-79.35698,Restaurant
3,26,"Regent Park, Harbourfront",43.65514,-79.36265,Flame Shack,43.656844,-79.358917,Restaurant
4,36,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141,JOEY,43.724131,-79.454042,American Restaurant


## Counting the restaurants and sushi restaurants in each borough

In [112]:
restaurants_count = restaurant_df[['Neighborhood', 'Venue']].groupby('Neighborhood').count().reset_index()
restaurants_count.rename(columns={"Venue": "Restaurants Count"}, inplace=True)
restaurants_count.set_index('Neighborhood', inplace=True)
restaurants_count.head()

Unnamed: 0_level_0,Restaurants Count
Neighborhood,Unnamed: 1_level_1
Agincourt,1
"Bedford Park, Lawrence Manor East",9
Berczy Park,17
"Brockton, Parkdale Village, Exhibition Place",18
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",28


In [113]:
sushi_df = restaurant_df[restaurant_df['Venue Category'].str.find('Sushi')>-1]
sushi_count = sushi_df[['Neighborhood', 'Venue']].groupby('Neighborhood').count().reset_index()
sushi_count.rename(columns={"Venue": "Sushi Count"}, inplace=True)
sushi_count.set_index('Neighborhood', inplace=True)
sushi_count.head()

Unnamed: 0_level_0,Sushi Count
Neighborhood,Unnamed: 1_level_1
Agincourt,1
"Bedford Park, Lawrence Manor East",1
Berczy Park,1
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",2
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",1


In [145]:
final_df = sushi_count.join(restaurants_count)

In [146]:
final_df = final_df.join(toronto_df.set_index('Neighbourhood'))

In [147]:
final_df.reset_index(inplace=True)
final_df.rename(columns={"level_0":"Neighborhood"}, inplace=True)
final_df.head()

Unnamed: 0,Neighborhood,Sushi Count,Restaurants Count,index,Postal Code,Borough,Latitude,Longitude
0,Agincourt,1,1,117,M1S,Scarborough,43.79394,-79.26711
1,"Bedford Park, Lawrence Manor East",1,9,85,M5M,North York,43.73546,-79.41915
2,Berczy Park,1,17,31,M5E,Downtown Toronto,43.64531,-79.37368
3,"Business reply mail Processing Centre, South C...",2,28,168,M7Y,East Toronto,43.64869,-79.38544
4,"CN Tower, King and Spadina, Railway Lands, Har...",1,18,139,M5V,Downtown Toronto,43.64082,-79.39956


## Final scoring and mapping

Here we attribute a score to each borough with at least 1 sushi restaurant, based on the number of restaurants and sushi restaurants : \
The more restaurants, the higher the score, because there is a demand for these venues. \
The more sushi restaurants, the lower the score, because there is too much competition.

In [152]:
final_df['Score'] = final_df['Restaurants Count']/final_df['Sushi Count']
final_df['Score'] = round(final_df['Score']/final_df['Score'].max()*5, 1)
final_df.sort_values(by='Score', ascending=False, inplace=True)

In [165]:
mapping_df = final_df.head(15)
mapping_df

Unnamed: 0,Neighborhood,Sushi Count,Restaurants Count,index,Postal Code,Borough,Latitude,Longitude,Score
9,"First Canadian Place, Underground city",1,27,157,M5X,Downtown Toronto,43.64828,-79.38146,5.0
21,"Toronto Dominion Centre, Design Exchange",1,26,67,M5K,Downtown Toronto,43.6471,-79.38153,4.8
10,"Garden District, Ryerson",1,23,13,M5B,Downtown Toronto,43.65736,-79.37818,4.3
4,"CN Tower, King and Spadina, Railway Lands, Har...",1,18,139,M5V,Downtown Toronto,43.64082,-79.39956,3.3
2,Berczy Park,1,17,31,M5E,Downtown Toronto,43.64531,-79.37368,3.1
22,"University of Toronto, Harbord",1,15,121,M5S,Downtown Toronto,43.66311,-79.4018,2.8
3,"Business reply mail Processing Centre, South C...",2,28,168,M7Y,East Toronto,43.64869,-79.38544,2.6
5,Canada Post Gateway Processing Centre,2,28,114,M7R,Mississauga,43.64869,-79.38544,2.6
19,Stn A PO Boxes,2,28,148,M5W,Downtown Toronto,43.64869,-79.38544,2.6
18,"Richmond, Adelaide, King",2,25,49,M5H,Downtown Toronto,43.6497,-79.38258,2.3


In [169]:
latitude = mapping_df['Latitude'].mean(axis=0)
longitude = mapping_df['Longitude'].mean(axis=0)

map_sushi = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, score, neighborhood in zip(mapping_df['Latitude'], mapping_df['Longitude'], mapping_df['Score'], mapping_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, score)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sushi)  
    
map_sushi

# Results

Thanks to our analysis we narrowed all the possible locations for a new sushi restaurant to a few. \
The score attibuted to every neighborhood shows that 3 of them seem over the others, with a score over 4/5. These places are indeed rich in restaurants, and thus in clients, but do not have many sushi restaurants yet.
If our client is not interested by these 3 locations for any reason (financial for instance), we are able to give him an ordered list of the most interesting neighborhoods to install his restaurant.

# Discussion

As said earlier, there are 3 neighborhoods that seem very interseting for our client. Namely : 
* *First Canadian Place, Underground city*
* *Toronto Dominion Centre, Design Exchange*
* *Garden District, Ryerson*

But these locations are in the center of Toronto and it may be expensive to install a new restaurant there. That is why we can also propose to our client some locations that seem a bit less effective for his sushi restaurant, but are further from the center. These are :
* *Willowdale, Willowdale East*
* *University of Toronto, Harbord*
* *CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport*

# Conclusion

The detailed results make sense, showing that the best locations for a new restaurant are near the center of town, always without considering the costs. Nevertheless, we were able to find some locations that seem to be a good compromise between effectiveness and cost. \
We truely hope that our analysis will help our client to find the perfect location for his restaurant.