# Business Problem

Leo is an owner of the Hotpot restaurant chain in China. Recently, his daughter came to Toronto to pursue her master's degree, so Leo decides to expand his business to Canada and move to Toronto with her daughter to take good care of her. Since Leo is not very good at English, he'd like to live in a neighbourhood where many Chinese people gather. However, since he also needs to make a living, he didn't want to open his hotpot restaurant in the places where too many Chinese restaurants existed, because too many competitors will make the business to be hard-hitting.So we need to help him find the place where Chinese people like to gather but without too many Chinese restaurants.

# Data Description

To solving Leo's problem, we will use the Foursquare location data and postal-code list data from Wikipedia and Geospatial_Coordinates data in combination. We firstly scraped the postal-code list from URL: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. The list from Wikipedia contains Canadian Postal code with the beginning of M, and the corresponding Borough and Neighborhood. Then we combine the list table with Geospatial_Coordinates data which includes each Neighborhoods'longitude and latitude. After these processes have been done, we begin to explore the neighbourhoods in Toronto by using Foursquare location data. For instance, if Leo would like to live in the downtown Toronto, we first of all cluster the neighbourhoods by Downtown Toronto, and use Foursquare location data to build up the table which contains Downtown Toronto's Neighborhoods and the popular venues (bank, cafe, yoga studio, Indian restaurant,Chinese restaurant)in each of the neighbourhoods. Then according to the data frame, we can figure out the appropriate neighbourhood for Leo to live Where the Chinese like to gather, and develop a hotpot business where isn't many competitors exist.

# Importing necessary libraries

In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib.request
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json 
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
import folium


# Loading Data

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
right_table=soup.find('table', class_='wikitable sortable')
right_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

# Setting up three columns: PostalCode, Borough, and Neighborhood

In [4]:
data = []
columns = []
table = soup.find(class_='wikitable')
for index, tr in enumerate(right_table.find_all('tr')):
    section = []
    for td in tr.find_all(['th','td']):
        section.append(td.text.rstrip())
    
    #First row of data is the header
    if (index == 0):
        columns = section
    else:
        data.append(section)

#convert list into Pandas DataFrame
df1 = pd.DataFrame(data = data,columns = columns)
df1.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


By looking at the data frame we set up, there are some missing values existed in this table. In order to assure the accuracy and consistency of our analysis, we need to clean up the missing values.

# Data Cleanup

In [5]:
#Ignore cells with a borough that is Not assigned.'
df1.drop(df1.index[df1['Borough'] == 'Not assigned'], inplace = True)
df1 = df1.reset_index(drop=True)
df1.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
df1.loc[df1['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df1['Borough']
df1.shape

(103, 3)

From nowon, we got the dataframe without missing data and duplicate values, let's combine it with geospatial_coordinates data.

In [7]:
import geocoder # import geocoder
data=pd.read_csv("/Users/iris/Desktop/Geospatial_Coordinates.csv")
data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [8]:
df1['Latitude']=data['Latitude'].values
df1['Longitude']=data['Longitude'].values
df1

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.806686,-79.194353
1,M4A,North York,Victoria Village,43.784535,-79.160497
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.770992,-79.216917
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.744734,-79.239476
6,M1B,Scarborough,"Malvern, Rouge",43.727929,-79.262029
7,M3B,North York,Don Mills,43.711112,-79.284577
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.716316,-79.239476
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.692657,-79.264848


Since Leo's daughter is studying in the downtown Toronto, Leo prefer to live in the downtown toronto to take good care of his daughter. So let's start focusing on Downtown Toronto's neighborhood.

# Cluster Neighborhood

In [9]:
toronto_data = df1[df1['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.692657,-79.264848
3,M5C,Downtown Toronto,St. James Town,43.799525,-79.318389
4,M5E,Downtown Toronto,Berczy Park,43.75749,-79.374714


In [10]:
#Getting address of Downtown Toronto
address = 'Downtown Toronto, ON'
geolocator = Nominatim(user_agent="trt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [34]:
# create map of Downtown Toronto using latitude and longitude values
map_trt = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df1['Latitude'], df1['Longitude'], df1['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_trt)  
    
map_trt

# Exploring Neighborhoods in Downtown Toronto

In [36]:
neighborhood_latitude = df1.loc[0, 'Latitude'] 
neighborhood_longitude = df1.loc[0, 'Longitude'] 

neighborhood_name = df1.loc[0, 'Neighborhood']
LIMIT=100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, neighborhood_latitude, 
    neighborhood_longitude, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=A3OE4MF5PIMSSGP4WYSJ1IWTZCY54VSETX0XZMF0ZTPEYT3K&client_secret=IO40RBJTWRID4F2JFHWSHQSSK3VTNZJ0X11E4BBOR0BE4A0I&ll=43.6563221,-79.3809161&v=20180605&query=43.806686299999996&radius=-79.19435340000001&limit=500'

By using foursqure data, we could explore the nearby venues in downtown toronto to have a look at popular venues among each neighborhood.

In [37]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [38]:
trt_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley


In [39]:
#check how many venues were returned for each neighborhood
trt_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,6,6,6,6,6,6
Christie,2,2,2,2,2,2
Church and Wellesley,7,7,7,7,7,7
"Commerce Court, Victoria Hotel",4,4,4,4,4,4
"First Canadian Place, Underground city",1,1,1,1,1,1
"Garden District, Ryerson",4,4,4,4,4,4
"Harbourfront East, Union Station, Toronto Islands",7,7,7,7,7,7
"Kensington Market, Chinatown, Grange Park",35,35,35,35,35,35
"Queen's Park, Ontario Provincial Government",9,9,9,9,9,9


# Analyzing Neighborhood

In [40]:
# one hot encoding
trt_onehot = pd.get_dummies(trt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
trt_onehot['Neighborhood'] = trt_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [trt_onehot.columns[-1]] + list(trt_onehot.columns[:-1])
trt_onehot = trt_onehot[fixed_columns]
trt_grouped = trt_onehot.groupby('Neighborhood').mean().reset_index()
trt_grouped.shape

(18, 84)

# Checking out TOP 10 poplular venues in Downtown Toronto

In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = trt_grouped['Neighborhood']

for ind in np.arange(trt_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(trt_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"CN Tower, King and Spadina, Railway Lands, Har...",Pizza Place,Brewery,Comic Shop,Farmers Market,Restaurant,Light Rail Station,Skate Park,Fast Food Restaurant,Burrito Place,Spa
1,Central Bay Street,Pharmacy,Grocery Store,Bank,Discount Store,Coffee Shop,Pizza Place,Fast Food Restaurant,Dessert Shop,Diner,Electronics Store
2,Christie,Food & Drink Shop,Park,Yoga Studio,Diner,Discount Store,Electronics Store,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop
3,Church and Wellesley,Pizza Place,Discount Store,Coffee Shop,Intersection,Sandwich Place,Chinese Restaurant,Middle Eastern Restaurant,Fast Food Restaurant,Diner,Electronics Store
4,"Commerce Court, Victoria Hotel",Playground,Park,Trail,Tennis Court,Fish & Chips Shop,Dessert Shop,Diner,Discount Store,Electronics Store,Falafel Restaurant


In [43]:
trt_data = neighborhoods_venues_sorted[neighborhoods_venues_sorted['3rd Most Common Venue'] == 'Chinese'].reset_index(drop=True)
trt_data.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


According to the table above, both St.James Town and Church and Wellesley seem very suitable places for Leo to live, since both of them have a very common venue which is a Chinese restaurant. So it's very convenient for Leo and her daughter to have hometown food, also Chinese restaurant becomes popular venue could mean that there might be many Chinese people who are living in this area, so Leo can meet many new friends from China and it will help him adjust quickly to his surroundings.

# Explore the competitors in St.James Town.

Let's again assume that Leo choose to rent a condo in St.James Town. So we are going to  convert the St.James Town's address to its latitude and longitude coordinates.

In [193]:
address = 'St.James Town, Toronto, ON'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

43.6694032 -79.3727041


In [194]:
search_query = 'Chinese'
radius = 2000
print(search_query + ' .... OK!')

Chinese .... OK!


In [195]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=A3OE4MF5PIMSSGP4WYSJ1IWTZCY54VSETX0XZMF0ZTPEYT3K&client_secret=IO40RBJTWRID4F2JFHWSHQSSK3VTNZJ0X11E4BBOR0BE4A0I&ll=43.6694032,-79.3727041&v=20180605&query=Chinese&radius=2000&limit=100'

In [196]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f06798f0bf45922e622e228'},
 'response': {'venues': [{'id': '58ab6f3eea29b818ab66cb3e',
    'name': 'Wok & Roast Chinese BBQ',
    'location': {'address': '349 Broadview Ave',
     'crossStreet': 'Gerrard St E',
     'lat': 43.665067,
     'lng': -79.352298,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.665067,
       'lng': -79.352298}],
     'distance': 1712,
     'postalCode': 'M4M 2H1',
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['349 Broadview Ave (Gerrard St E)',
      'Toronto ON M4M 2H1',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d145941735',
      'name': 'Chinese Restaurant',
      'pluralName': 'Chinese Restaurants',
      'shortName': 'Chinese',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/asian_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1594260023',
    'hasPerk': False},
   {'id'

In [197]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.shape

  """


(20, 19)

# Explore the competitors in Church and Wellesley

In [188]:
address1 = 'Church and Wellesley, Toronto, ON'

geolocator1 = Nominatim(user_agent="foursquare_agent")
location1 = geolocator.geocode(address)
latitude1 = location.latitude
longitude1 = location.longitude
print(latitude1, longitude1)

43.6655242 -79.3838011


In [189]:
search_query = 'Chinese'
radius = 2000
print(search_query + ' .... OK!')

Chinese .... OK!


In [190]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=A3OE4MF5PIMSSGP4WYSJ1IWTZCY54VSETX0XZMF0ZTPEYT3K&client_secret=IO40RBJTWRID4F2JFHWSHQSSK3VTNZJ0X11E4BBOR0BE4A0I&ll=43.6655242,-79.3838011&v=20180605&query=Chinese&radius=2000&limit=100'

In [191]:
results1 = requests.get(url).json()
results1

{'meta': {'code': 200, 'requestId': '5f06793b1dc55400f9e08ac6'},
 'response': {'venues': [{'id': '4b622bf1f964a5200b3a2ae3',
    'name': 'Chinese Traditional Buns',
    'location': {'address': '536 Dundas St. W',
     'crossStreet': 'at Spadina Ave',
     'lat': 43.652714116168525,
     'lng': -79.39900618361557,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.652714116168525,
       'lng': -79.39900618361557}],
     'distance': 1879,
     'postalCode': 'M5T 1H3',
     'cc': 'CA',
     'neighborhood': 'Kensington Market',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['536 Dundas St. W (at Spadina Ave)',
      'Toronto ON M5T 1H3',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d145941735',
      'name': 'Chinese Restaurant',
      'pluralName': 'Chinese Restaurants',
      'shortName': 'Chinese',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/asian_',
       'suffix': '.png'},
      'prim

In [192]:
# assign relevant part of JSON to venues
venues1 = results['response']['venues']

# tranform venues into a dataframe
dataframe1 = json_normalize(venues)
dataframe1.shape

  """


(39, 19)

# Conclusion

By looking at the location data and foursquare data we explored, we could find out that the neighbourhood where Chinese people always hand out is St.James Town and Church and Wellesley since Chinese restaurant is the No.3 most popular venues in these areas, so both of them are very suitable places for Leo to live. However, since Leo needs to develop his own hotpot business, St.James Town seems like a better choice for him because if Leo lives in this neighbourhood and open his hotpot restaurant, he will get 20 competitors in about two kilometres, but if he lives in the Church and Wellesley, he will have 39 competitors, nearly double the number that St.James Towns has. In conclusion, we will suggest Leo move to St.James Town to begin his new life in Toronto.