# Applied Data Science Capstone Project  
## Analysis of Vancouver BC, Canada
### Dino Rossi - November 2020

![Image of Vancouver](https://maps-vancouver.com/img/ban.jpg)Image source: https://maps-vancouver.com/img/ban.jpg

## 1. Introduction / Buisness Problem 
Vancouver is a densly populated city in the province of British Columbia on the Pacific coast of Canada. With a population of 675,000 in the city and 2,500,000 in the metropolitan area, it is the largest city in British Columbia and the thrid largest metropolitan area in Canada.  

While dense metropolitan areas bring opportunities, they also create constraints. These constraints can lead to stiff competition and high rents for prime business locations. Because of this, choosing the right location for a new business can make the difference between success and failure. The "right" location will mean different things to different people. Some might want to search out a "low" competition neighborhood where there are few restaurants, while others would prefer a "high" competition neighborhood in order to be situted within a bustling scene.

There is no substitute for local knowledge and understanding of a city and its neighborhoods, but there are often larger trends/patterns that are difficult to see. This projects sets out to take advantage of powerful data science tools and techniques in order to gain new insights into the city of Vancouver in order to understand some of those paterns and trends that are not necessarily visible or obvious. These insights will facilitate determining the best location to open a new restaurant by adding layers of information thay will be complementary to local knowledge. The results will be usuable by nayone looking to open a restaurant in Vancouver, and can be adapted to various use cases. 

<img src="https://www.cas-satj.gc.ca/images/canada-map.png" width="500px">

Image source: https://www.cas-satj.gc.ca/images/canada-map.png

## 2. Data Aquisition and Handling
A *dataset* will be built through the use of *web scraping* and pulling venue data through the *Foursquare API*. The data will be *wrangled* into shape using the *Pandas* library. Analysis of the data will be carried outwith the *Scikit-learn* library, in particular *K-means clustering* will be used. Finally, the results will be displayed as *maps* of the city, which will be produced using the *Folium* library. These maps can be used to narrow down potential locations for a new restaurant.

### 2.1 Building the Dataset...  
In order to build the data set it is necessary to aquire the postal codes for Vancouver. Canada uses an Alphanumeric postal code system. The country is broadly devided into 18 postal regions (see image). These regions are then further subdevided into smaller zones. British Columbia ("V" on the map) has 192 postal codes, but this project will only be looking at the postal codes in and immedialy around the city of Vancouver.  
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Canadian_postal_district_map.svg/1024px-Canadian_postal_district_map.svg.png" width="400">
Image source: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Canadian_postal_district_map.svg/1024px-Canadian_postal_district_map.svg.png  
  
Zooming in on Vancouver enables us to select the appropriate postal codes for the areas we wish to analyze. By studying the map below, we can see that the postal codes of interest include the ones starting with V5, V6, and V7. According to this site (https://worldpostalcode.com/canada/british-columbia/vancouver) Vancouver city includes these postal codes: V6L V5R V6H V6G V6E V6C V6B V6A V5Z V5Y V5X V5W V6K V5T V5S V5P V6M V5N V5M V5L V5V V7Y V7X V6Z V6T V6S V6R V6P V6N V5K V6J  
<img src="https://maps-vancouver.com/img/0/vancouver-postal-code-map.jpg" width="400">
Image source: https://maps-vancouver.com/img/0/vancouver-postal-code-map.jpg  

The rough data for the postal codes can be scraped from this Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V The data includes all 192 postal codes for the British Columbia region, so the data set will nedd to be narrowed down to include only the appropriate codes.  

Once a clean dataframe of postal codes and nieghborhoods is created, webscrapping can again be employed, this time to aquire the geolocations (longitude and latitude) for each postal code.  

After the geolocations have been added to the dataframe API calls can be made to Frousquare to aquire venue data, which in turn will be appended to the dataframe.  

When the dataframe contains all the necessary/relevant data analysis/exploration can begin.

## 3. Code  
This section will show the code used to get, clean, and analyze the data, as well as generate the maps.

In [1]:
#Install Pandas and read data frames from Wikipedia page
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V'
dfs = pd.read_html(url)
print(len(dfs))

5


In [2]:
# create dataframe from correct table and display it
df1 = dfs[0]
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,V1AKimberley,V2APenticton,V3ALangley Township(Langley City),V4ASurreySouthwest,V5ABurnaby(Government Road / Lake City / SFU /...,V6AVancouver(Strathcona / Chinatown / Downtown...,V7ARichmondSouth,V8APowell River,V9AVictoria(Vic West / Esquimalt)Canadian Forc...
1,V1BVernonEast,V2BKamloopsNorthwest,V3BPort CoquitlamCentral,V4BWhite Rock,V5BBurnaby(Parkcrest-Aubrey / Ardingley-Sprott),V6BVancouver(NE Downtown / Gastown / Harbour C...,V7BRichmond(Sea Island / YVR),V8BSquamish,V9BVictoria(West Highlands / North Langford / ...
2,V1CCranbrook,V2CKamloopsCentral and Southeast,V3CPort CoquitlamSouth,V4CDeltaNortheast,V5CBurnaby(Burnaby Heights / Willingdon Height...,V6CVancouver(Waterfront / Coal Harbour / Canad...,V7CRichmondNorthwest,V8CKitimat,V9CVictoria(Colwood / South Langford / Metchosin)
3,V1ESalmon Arm,V2EKamloopsSouth and West,V3ECoquitlamNorth,V4EDeltaEast,V5EBurnaby(Lakeview-Mayfield / Richmond Park /...,V6EVancouver(SE West End / Davie Village),V7ERichmondSouthwest,V8EWhistler,V9EVictoria(East Highlands / NW Saanich)
4,V1GDawson Creek,V2GWilliams Lake,V3GAbbotsfordEast,V4GDeltaEast Central,V5GBurnaby(Cascade-Schou / Douglas-Gilpin),V6GVancouver(NW West End / Stanley Park),V7GNorth Vancouver (district municipality)Oute...,V8GTerrace,V9GLadysmith


In [4]:
# sort all the names into a single column
import itertools as it
df2 = pd.DataFrame(sorted(it.chain(*df1.values)))
print(df2.shape)
df2.head()

(180, 1)


Unnamed: 0,0
0,V1AKimberley
1,V1BVernonEast
2,V1CCranbrook
3,V1ESalmon Arm
4,V1GDawson Creek


In [5]:
# rename postal code column to "post_code"
df3 = df2.rename(columns={0: 'post_code'})
df3

Unnamed: 0,post_code
0,V1AKimberley
1,V1BVernonEast
2,V1CCranbrook
3,V1ESalmon Arm
4,V1GDawson Creek
...,...
175,V9VNanaimoNorthwest
176,V9WCampbell RiverCentral
177,V9XNanaimo(Cedar)
178,V9YPort Alberni


In [6]:
# split post codes and neighborhood names into separate columns
df3['neighborhood'] = df3['post_code'].str.slice(start=3)
df3['post_code'] = df3['post_code'].str.slice(stop=3)
df3

Unnamed: 0,post_code,neighborhood
0,V1A,Kimberley
1,V1B,VernonEast
2,V1C,Cranbrook
3,V1E,Salmon Arm
4,V1G,Dawson Creek
...,...,...
175,V9V,NanaimoNorthwest
176,V9W,Campbell RiverCentral
177,V9X,Nanaimo(Cedar)
178,V9Y,Port Alberni


### Select Vacouver city post codes 
Vancouver city post codes include:
V6L V5R V6H V6G V6E V6C V6B V6A V5Z V5Y V5X V5W V6K V5T V5S V5P V6M V5N V5M V5L V5V V7Y V7X V6Z V6T V6S V6R V6P V6N V5K V6J
These need to be selected out of the data frame which currently includes all post codes for the Province of British Columbia

In [35]:
# create series of post codes in Vancouver city 
vancouver_city = 'V6L V5R V6H V6G V6E V6C V6B V6A V5Z V5Y V5X V5W V6K V5T V5S V5P V6M V5N V5M V5L V5V V7Y V7X V6Z V6T V6S V6R V6P V6N V5K V6J'.split()
vancouver_city

df4 = df3[df3['post_code'].isin(vancouver_city)]
df4.head()

Unnamed: 0,post_code,neighborhood
87,V5K,Vancouver(North Hastings-Sunrise)
88,V5L,Vancouver(North Grandview-Woodland)
89,V5M,Vancouver(South Hastings-Sunrise / North Renfr...
90,V5N,Vancouver(South Grandview-Woodland / NE Kensin...
91,V5P,Vancouver(SE Kensington-Cedar Cottage / Victor...


In [36]:
# split "neighborhood" column into "city" (Vancouver) and "neighborhood" (the rest of string)
df5 = df4.copy()
df5['neighborhood'] = df5['neighborhood'].str.slice(start = 10, stop = -1)
df5.head()

Unnamed: 0,post_code,neighborhood
87,V5K,North Hastings-Sunrise
88,V5L,North Grandview-Woodland
89,V5M,South Hastings-Sunrise / North Renfrew-Colling...
90,V5N,South Grandview-Woodland / NE Kensington-Cedar...
91,V5P,SE Kensington-Cedar Cottage / Victoria-Fraserview


### scrape the geo locations (longitude and latitude) for each post code  
### and append "long" and "lat" columns 

In [37]:
import requests

In [38]:
# a single post code test
resp = requests.get('http://geogratis.gc.ca/services/geolocation/en/locate?q=V5K')
df_test = pd.json_normalize(resp.json())
print('V5K latitude = ' + str(df_test['geometry.coordinates'][0][1]))
print('V5K longitude = ' + str(df_test['geometry.coordinates'][0][0]))

V5K latitude = 49.283199
V5K longitude = -123.038055


In [41]:
# loop through postcodes and scrape/append latitude and longitude columns  
lat = []
long = []
url_base = "http://geogratis.gc.ca/services/geolocation/en/locate?q="

for postal_code in df5["post_code"]:
        url = url_base + postal_code
        resp = requests.get(url)
        df = pd.json_normalize(resp.json())
        lat.append(df['geometry.coordinates'][0][1])
        long.append(df['geometry.coordinates'][0][0])

df5['Latitude'] = lat
df5['Longitude'] = long
df5.head()

Unnamed: 0,post_code,neighborhood,Latitude,Longitude
87,V5K,North Hastings-Sunrise,49.283199,-123.038055
88,V5L,North Grandview-Woodland,49.282318,-123.065453
89,V5M,South Hastings-Sunrise / North Renfrew-Colling...,49.261265,-123.038399
90,V5N,South Grandview-Woodland / NE Kensington-Cedar...,49.25679,-123.069305
91,V5P,SE Kensington-Cedar Cottage / Victoria-Fraserview,49.224976,-123.066719


In [42]:
# install the Folium mapping library
!pip install folium

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.0 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [44]:
# import necessary dependancies
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
                                                                                                                    |failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - cffi -> python[version='2.7.*|3.5.*|3.6.*|3.6.9|3.6.9|3.6.9|3.6.9|>=3.6,<3.7.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0|>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|3.4.*',build='0_73_pypy|3_73_pypy|2_73_py

### Get the goelocation of Vancouver

In [46]:
address = 'Vancouver, BC'

geolocator = aNominatim(user_agent="vancouver_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Vancouver are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Vancouver are 49.2608724, -123.1139529.


In [None]:
### Render map of Vancouver with popup labels for each neighborhood

In [55]:
map_vancouver = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, post_code, neighborhood in zip(df5['Latitude'], df5['Longitude'], df5['post_code'], df5['neighborhood']):
    label = '{}, {}'.format(post_code, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_vancouver)  
    
map_vancouver