# Capstone Project - Relocation from New York City to Toronto
### For Applied Data Science Capstone Course by IBM/Coursera

## Table of contents
* [Introduction: Business Problem Description](#introduction)
* [Data Collection](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem Description <a name="introduction"></a>

This project is designed to resolve the following assumed business case:

A US business manager who is working for a large financial company and also living in Manhatten, New York city. Recently his company has asked him to relocate from New York to Toronto Canada to lead the efforts to set up a branch business there. He needs to decide where to set up the business office in Toronto. He also needs to select a new residential location for his family, preferrablely in a similar favorable location to his home in Manhatten in New York.

This business manager approached a specialized service company to help him to evaluate the location options in Toronto for both business office and his family home in Toronto. After signing the service contract, this service company starts to work on this project. This project could involve the following aspects:

1. First to evaluate and to understand the preferred characteristics from this US business manager on his current business office and family home in Manhatten in New York.
2. Second to collect and to evaluate information related to Toronto.
3. Compare the similarities and differences between New York and Toronto.
4. Based on US manager's preferred selection criterias to recommend location options on new business office and residential home in Toronto.

For each of above project aspects, we will conduct relevent data collection and data analysis to provide quantitative asscessment. We will work the following data science project steps:
1. Collect neighborhood information between New York and Toronto.
2. Collect relevant venue information that could be required to compare between New York and Toronto. 
3. Analyze the collected data to compare the similarities and differences between New York and Toronto.
4. Based on US business manager's proposed location selection preference criterias to recommend location options for both busines office and family home in Toronto.



## Data Collection <a name="data"></a>

#### Based on the above busienss problem descriptinon, we will collect the following data sets:  
1. Neighborhood and venue information for New York city, including:  
    (1) Neighborhood information for New York city.  
    (2) Neighborhood information for Manhatten, New York.  
    (3) Venues around existing business office in Manhatten, New York city.  
    (4) Venues around existing family home locations in Manhatten, New York city.      


2. Neighborhood and venue information for Toronto, including:  
    (1) Neighborhood information for Toronto.  
    (2) Neighborhood information for Toronto financial central location.  
    (3) Venue information for Toronto financial central location.  

## 1. Collect neighborhood and venue Information for New York city

### (1) Collect neighborhood Information for New York city

To get New York neighborhood information, we first download existing dataset newyork_data online. We convert the online dataset into pandas dataframe. The neighborhood information is then visualized on New York map.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



First to input the existing dataset from web site

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
neighborhoods_data = newyork_data['features']


The next is to transform the data into a pandas dataframe.

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [7]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Use geopy library to get the latitude and longitude values of New York City

In [8]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [9]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [10]:
df_newyork = neighborhoods
df_newyork.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### (2) Collect neighborhood Information for Manhatten, New York

The Manhatten neighborhood data is a subset from New York neighborhood dataset. The data is visualized on a map.

In [11]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [12]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


In [13]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

### (3) Collect venues around existing business office in Manhatten, New York city. 

Assume that the business office address is 100 Wall St New York, near New York Stock Exchange location.

#### We are using Foursqure API to collect venue information

In [14]:
CLIENT_ID = '0D33T0AVFXWBBXCZDHJ2IA32T5GG5IK1J5GYXBAIP14Z4KRF' # your Foursquare ID
CLIENT_SECRET = '0L2NKXGCDXVV3HLBJDOJDSNIOPCOODUIZOQQECZ4DC5F2ZIQ' # your Foursquare Secret
VERSION = '20180604'

In [15]:
address = '100 Wall St, New York, NY'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

40.7052203 -74.006799602293


In [16]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url 

'https://api.foursquare.com/v2/venues/explore?&client_id=0D33T0AVFXWBBXCZDHJ2IA32T5GG5IK1J5GYXBAIP14Z4KRF&client_secret=0L2NKXGCDXVV3HLBJDOJDSNIOPCOODUIZOQQECZ4DC5F2ZIQ&v=20180604&ll=40.7052203,-74.006799602293&radius=500&limit=100'

In [None]:
results = requests.get(url).json()
results

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [51]:
venues = results['response']['groups'][0]['items'] 

nearby_venues = json_normalize(venues) 

In [52]:
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

ManhattenOffice_venues = nearby_venues
ManhattenOffice_venues.head(20)

Unnamed: 0,name,categories,lat,lng
0,Crown Shy,Restaurant,40.706187,-74.00749
1,sweetgreen,Salad Place,40.705586,-74.008382
2,Black Fox Coffee Co.,Coffee Shop,40.706573,-74.008155
3,La Colombe Torrefaction,Coffee Shop,40.705899,-74.008421
4,East River Esplanade,Pedestrian Plaza,40.704847,-74.004593
5,SoulCycle FiDi,Cycle Studio,40.706904,-74.006717
6,Dig Inn,American Restaurant,40.706106,-74.00729
7,City Acres Market,Food Court,40.706261,-74.00773
8,Adel's best #1 Halal Food Cart,Falafel Restaurant,40.705609,-74.005599
9,Westville Wall Street,American Restaurant,40.70476,-74.006732


In [21]:
print('Around Manhatten Office, there are over {} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

Around Manhatten Office, there are over 100 venues were returned by Foursquare.


### (4) Collect venues around family home locations in Manhatten, New York city

Assume that the family home address is 63 Wall St New York, where is well known 63 Wall St Apartment building. 

In [22]:
address = '63 Wall St, New York, NY'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

40.70564705 -74.0087563616202


In [23]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [24]:
venues = results['response']['groups'][0]['items'] 
nearby_venues = json_normalize(venues) 

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

ManhattenHome_venues = nearby_venues
ManhattenHome_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Crown Shy,Restaurant,40.706187,-74.00749
1,sweetgreen,Salad Place,40.705586,-74.008382
2,Black Fox Coffee Co.,Coffee Shop,40.706573,-74.008155
3,La Colombe Torrefaction,Coffee Shop,40.705899,-74.008421
4,East River Esplanade,Pedestrian Plaza,40.704847,-74.004593


In [25]:
print('Around Manhatten Family Home, there are over {} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

Around Manhatten Family Home, there are over 100 venues were returned by Foursquare.


## Output four collected New York datasets as local data files

In [26]:
neighborhoods.to_csv("New_York_neighborhood_data.csv")
manhattan_data.to_csv("Manhattan_neighborhood_data.csv")
ManhattenOffice_venues.to_csv("ManhattanOffice_venues_data.csv")
ManhattenHome_venues.to_csv("ManhattanHome_venues_data.csv")

## Collect neighborhood and venue information for Toronto

We will collect the following neighborhood and venue information for Toronto:  
    (1) Neighborhood information for Toronto.  
    (2) Neighborhood information for Toronto financial central location.   
    (3) Venue information for Toronto financial central location.

## (1) Collect neighborhood information for Toronto

The Toronto neighborhood data is scraped from a Wikipedia page. The data was structured into pandas dataframe

In [27]:
import lxml.html as lh

In [28]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

In [29]:
col=[]
i=0
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

In [30]:
for j in range(1,len(tr_elements)):
    T=tr_elements[j]
    
    if len(T)!=3:
        break
    i=0
    for t in T.iterchildren():
        data=t.text_content() 
        if i>0:
            try:
                data=int(data)
            except:
                pass
        col[i][1].append(data)
        i+=1

In [31]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [32]:
df.columns = df.columns.str.strip()
df['Neighbourhood'] = df.Neighbourhood.str.replace('\n','')
df.drop(df.loc[df['Borough']=="Not assigned"].index, inplace=True)

In [33]:
df_group=df.groupby(['Postcode','Borough'],as_index=False)['Neighbourhood'].agg(','.join)

In [34]:
df_coordinate=pd.read_csv('Geospatial_Coordinates.csv')
df_coordinate.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
df_toronto = df_group.merge(df_coordinate,on='Postcode')

In [35]:
df_toronto.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)

In [36]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_toronto['Borough'].unique()),
        df_toronto.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


In [37]:
address = 'Toronto City, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.7189883, -79.44157.


In [38]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [39]:
df_toronto = df_toronto.drop(['Postcode'], axis=1)

In [40]:
df_toronto.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,Scarborough,Woburn,43.770992,-79.216917
4,Scarborough,Cedarbrae,43.773136,-79.239476


In [41]:
df_toronto.shape

(103, 4)

## (2) Collect neighborhood information for Toronto financial central location  

The Toronto financial center is located in Central Toronto borough. The Central Toronto neighborhood data is a subset from Toronto neighborhood dataset. It is visually shown on a map.

In [53]:
TorontoCentral_data = df_toronto[df_toronto['Borough'] == 'Central Toronto'].reset_index(drop=True)
TorontoCentral_data.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Central Toronto,Lawrence Park,43.72802,-79.38879
1,Central Toronto,Davisville North,43.712751,-79.390197
2,Central Toronto,North Toronto West,43.715383,-79.405678
3,Central Toronto,Davisville,43.704324,-79.38879
4,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
5,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049
6,Central Toronto,Roselawn,43.711695,-79.416936
7,Central Toronto,"Forest Hill North,Forest Hill West",43.696948,-79.411307
8,Central Toronto,"The Annex,North Midtown,Yorkville",43.67271,-79.405678


In [43]:
address = 'Central Toronto, Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Central Toronto are 43.653963, -79.387207.


In [44]:
# create map of Toronto Central using latitude and longitude values
map_TorontoCentral = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(TorontoCentral_data['Latitude'], TorontoCentral_data['Longitude'], TorontoCentral_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_TorontoCentral)  
    
map_TorontoCentral

## (3) Collect venue information for Toronto financial central location 

In [45]:
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Central Toronto are 43.653963, -79.387207.


In [46]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [54]:
venues = results['response']['groups'][0]['items'] 
nearby_venues = json_normalize(venues) 

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

TorontoCentral_venues = nearby_venues
TorontoCentral_venues.head(20)

Unnamed: 0,name,categories,lat,lng
0,Crown Shy,Restaurant,40.706187,-74.00749
1,sweetgreen,Salad Place,40.705586,-74.008382
2,Black Fox Coffee Co.,Coffee Shop,40.706573,-74.008155
3,La Colombe Torrefaction,Coffee Shop,40.705899,-74.008421
4,East River Esplanade,Pedestrian Plaza,40.704847,-74.004593
5,SoulCycle FiDi,Cycle Studio,40.706904,-74.006717
6,Dig Inn,American Restaurant,40.706106,-74.00729
7,City Acres Market,Food Court,40.706261,-74.00773
8,Adel's best #1 Halal Food Cart,Falafel Restaurant,40.705609,-74.005599
9,Westville Wall Street,American Restaurant,40.70476,-74.006732


In [48]:
print('Around Central Toronto, there are over {} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

Around Central Toronto, there are over 100 venues were returned by Foursquare.


## Output three collected Toronto datasets as local data files

In [49]:
df_toronto.to_csv("Toronto_neighborhood_data.csv")
TorontoCentral_data.to_csv("TorontoCentral_neighborhood_data.csv")
TorontoCentral_venues.to_csv("TorontoCentral_venues_data.csv")

## A description of how the data will be used to solve the problem.

We will use the above collected datasets in following ways for the second half of this project that will be submitted as Capstone Week 2 project.  

For the New York and Manhatten neighborhood and venue datasets, we will analyze to identify the clustering characteristics. Based on such analysis, we will define the selection criterias for the business office and family home locations in Toronto.

For the Toronto neighborhood and venue datasets, we will first analyze the clustering characteristics and to compare with New York and Manhatten data. Based on such analysis and comparison, we will identify potential location options for new business office and family home in Toronto.

# This concludes Capstone Week 1 project for business description and data collection and Screen.