# Capstone Project
# The Battle of Neighborhoods

## Introduction

- Toronto is a multicultural city with growing population (4,3% increase between 2011-2016). It is an attractive city for new investments.
- Fast growing areas in the city may have high potential for investors and entrepreneurs.
- Willowsdale East recognized as a fast-growing area with 62,3% change in population between 2001 and 2016. Business opportunities in this neighborhood are analyzed in this project. 
- Investment for a new venue in Willowdale East will be evaluated, since we will use Foursquare data for this project.

### Problem: Which venue categories promise the best business opportunities?
### Approach: Collaborative filtering approach is applied in this project, in order to find the most similar neighborhoods and identify opportunities by comparing Willowdale East with most similar neighborhoods.

#### Import all the libraries

In [1]:
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from math import sqrt

from sklearn.cluster import KMeans 

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


#### Scrape Toronto data

In [2]:
#Scrape data
table=pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969")
toronto_data=pd.DataFrame(table[0])
toronto_data.head()

#Delete the rows where Borough is not assigned
toronto_data['Borough'].replace('Not assigned',np.nan, inplace=True)
toronto_data.dropna(subset=['Borough'], axis=0, inplace=True)

#Groupby Postal Code
toronto_data.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
toronto_data[['Postal Code','Borough','Neighbourhood']].drop_duplicates()
toronto_data.rename(columns={"Neighbourhood":"Neighborhood"}, inplace=True)


toronto_data.shape

(103, 3)

In [3]:
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Import geospatial data

In [4]:
!wget -O geo_data.csv http://cocl.us/Geospatial_data
df_geo=pd.read_csv('geo_data.csv')
df_geo.head()

--2021-03-19 17:21:10--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 52.116.121.148, 52.116.127.82
Connecting to cocl.us (cocl.us)|52.116.121.148|:80... connected.
HTTP request sent, awaiting response... 308 Permanent Redirect
Location: https://cocl.us/Geospatial_data [following]
--2021-03-19 17:21:11--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|52.116.121.148|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-03-19 17:21:12--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.29.197
Connecting to ibm.box.com (ibm.box.com)|107.152.29.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-03-19 17:21:12--  https://ibm.box.com/public/static/9afz

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Merge Toronto data with geospatial data

In [5]:
toronto_data=toronto_data.merge(df_geo, left_on='Postal Code', right_on='Postal Code')
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494




## Use venue explore ressource from Foursquare

In [4]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
ACCESS_TOKEN = '' # your FourSquare Access Token

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


In [7]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,
            v['venue']['id'],
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['id'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude',
                  'Venue ID',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Category ID',
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [9]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Category ID,Venue Category
0,Parkwoods,43.753259,-79.329656,4e8d9dcdd5fbbbb6b3003c7b,Brookbanks Park,43.751976,-79.33214,4bf58dd8d48988d163941735,Park
1,Parkwoods,43.753259,-79.329656,4cb11e2075ebb60cd1c4caad,Variety Store,43.751974,-79.333114,4bf58dd8d48988d1f9941735,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,4c633acb86b6be9a61268e34,Victoria Village Arena,43.723481,-79.315635,4bf58dd8d48988d185941735,Hockey Arena
3,Victoria Village,43.725882,-79.315572,4f3ecce6e4b0587016b6f30d,Portugril,43.725819,-79.312785,4def73e84765ae376e57713a,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,4bbe904a85fbb713420d7167,Tim Hortons,43.725517,-79.313103,4bf58dd8d48988d1e0931735,Coffee Shop


In [10]:
toronto_venues.shape

(2127, 9)

## Data preprocessing

In [11]:
venues_by_category=toronto_venues.groupby(['Neighborhood','Category ID','Venue Category']).count().sort_values(by='Neighborhood').reset_index()
venues_by_category.rename(columns={'Venue ID':'Frequency'}, inplace=True)

In [12]:
inputNeigh=venues_by_category[venues_by_category['Neighborhood']=='Willowdale, Willowdale East']
inputNeigh.drop(columns=['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue','Venue Latitude','Venue Longitude'],inplace=True)
inputNeigh

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Category ID,Venue Category,Frequency
1468,52e81612bcbc57f1066b7a0c,Bubble Tea Shop,1
1469,52dea92d3cf9994f4e043dbb,Discount Store,1
1470,4bf58dd8d48988d1fd941735,Shopping Mall,1
1471,4bf58dd8d48988d1fa931735,Hotel,1
1472,4bf58dd8d48988d1e0931735,Coffee Shop,2
1473,4bf58dd8d48988d1d2941735,Sushi Restaurant,2
1474,4bf58dd8d48988d1cc941735,Steakhouse,1
1475,4bf58dd8d48988d1ca941735,Pizza Place,2
1476,4bf58dd8d48988d1c9941735,Ice Cream Shop,1
1477,4bf58dd8d48988d1c5941735,Sandwich Place,2


In [13]:
venueSubset = venues_by_category[venues_by_category['Category ID'].isin(inputNeigh['Category ID'].tolist())]
venueSubset.drop(columns=['Neighborhood Latitude','Neighborhood Longitude','Venue','Venue Latitude','Venue Longitude'],inplace=True)
venueSubset.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Neighborhood,Category ID,Venue Category,Frequency
0,Agincourt,4bf58dd8d48988d121941735,Lounge,1
4,"Alderwood, Long Branch",4bf58dd8d48988d1ca941735,Pizza Place,2
5,"Alderwood, Long Branch",4bf58dd8d48988d1c5941735,Sandwich Place,1
7,"Alderwood, Long Branch",4bf58dd8d48988d1e0931735,Coffee Shop,1
11,"Bathurst Manor, Wilson Heights, Downsview North",4bf58dd8d48988d1ca941735,Pizza Place,1


In [14]:
venueSubset.shape

(449, 4)

In [15]:
venueSubset.drop(venueSubset[venueSubset['Neighborhood']=='Willowdale, Willowdale East'].index, inplace=True)
venueSubsetGroup = venueSubset.groupby(['Neighborhood'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [16]:
venueSubsetGroup.head()

Unnamed: 0,Neighborhood,Category ID,Venue Category,Frequency
0,Agincourt,4bf58dd8d48988d121941735,Lounge,1
4,"Alderwood, Long Branch",4bf58dd8d48988d1ca941735,Pizza Place,2
5,"Alderwood, Long Branch",4bf58dd8d48988d1c5941735,Sandwich Place,1
7,"Alderwood, Long Branch",4bf58dd8d48988d1e0931735,Coffee Shop,1
11,"Bathurst Manor, Wilson Heights, Downsview North",4bf58dd8d48988d1ca941735,Pizza Place,1
...,...,...,...,...
1465,"Wexford, Maryvale",4bf58dd8d48988d1fd941735,Shopping Mall,1
1496,"Willowdale, Willowdale West",4bf58dd8d48988d1e0931735,Coffee Shop,1
1497,"Willowdale, Willowdale West",4bf58dd8d48988d1ca941735,Pizza Place,1
1499,"Willowdale, Willowdale West",4bf58dd8d48988d118951735,Grocery Store,1


## Similarity of Neighborhoods to Willowdale East

In [17]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in venueSubsetGroup:
    #Let's start by sorting the input and current neighborhood group so the values aren't mixed up later on
    group = group.sort_values(by='Category ID')
    inputNeigh = inputNeigh.sort_values(by='Category ID')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the categories that they both have in common
    temp_df = inputNeigh[inputNeigh['Category ID'].isin(group['Category ID'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['Frequency'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['Frequency'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

In [18]:
pearsonCorrelationDict.items()

dict_items([('Agincourt', 0), ('Alderwood, Long Branch', 0), ('Bathurst Manor, Wilson Heights, Downsview North', -0.05976143046671949), ('Bayview Village', 0), ('Bedford Park, Lawrence Manor East', 0.3779644730092279), ('Berczy Park', 0.5196746370519365), ('Birch Cliff, Cliffside West', 0), ('Brockton, Parkdale Village, Exhibition Place', 0.6123724356957944), ('Business reply mail Processing Centre, South Central Letter Processing Plant Toronto', 0), ('CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport', 0), ('Canada Post Gateway Processing Centre', 0.30151134457776363), ('Cedarbrae', 0), ('Central Bay Street', 0.2828460777067467), ('Christie', -0.7777777777777778), ('Church and Wellesley', 0.1614116742309718), ("Clarks Corners, Tam O'Shanter, Sullivan", 0.49999999999999967), ('Commerce Court, Victoria Hotel', 0.4528907231300196), ('Davisville', 0.5400617248673215), ('Davisville North', 0), ('Del Ray, Mount Dennis, Keelsdale and S

In [19]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['Neighborhood'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,Neighborhood
0,0.0,Agincourt
1,0.0,"Alderwood, Long Branch"
2,-0.059761,"Bathurst Manor, Wilson Heights, Downsview North"
3,0.0,Bayview Village
4,0.377964,"Bedford Park, Lawrence Manor East"


In [20]:
pearsonDF.shape

(73, 2)

### Most similar neighborhoods

In [21]:
#Top 20 similar neighborhoods
topNeigh=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:20]
topNeigh

Unnamed: 0,similarityIndex,Neighborhood
34,1.0,"High Park, The Junction South"
61,0.92582,Studio District
57,0.826874,St. James Town
58,0.745356,"St. James Town, Cabbagetown"
63,0.656532,"The Annex, North Midtown, Yorkville"
53,0.620621,"Richmond, Adelaide, King"
7,0.612372,"Brockton, Parkdale Village, Exhibition Place"
52,0.581914,"Regent Park, Harbourfront"
17,0.540062,Davisville
39,0.522958,"Kensington Market, Chinatown, Grange Park"


In [22]:
topNeighFreq=topNeigh.merge(venues_by_category, left_on='Neighborhood', right_on='Neighborhood', how='inner')
topNeighFreq.head()

Unnamed: 0,similarityIndex,Neighborhood,Category ID,Venue Category,Neighborhood Latitude,Neighborhood Longitude,Frequency,Venue,Venue Latitude,Venue Longitude
0,1.0,"High Park, The Junction South",52dea92d3cf9994f4e043dbb,Discount Store,1,1,1,1,1,1
1,1.0,"High Park, The Junction South",4bf58dd8d48988d16e941735,Fast Food Restaurant,1,1,1,1,1,1
2,1.0,"High Park, The Junction South",4d4ae6fc7a7b7dea34424761,Fried Chicken Joint,1,1,1,1,1,1
3,1.0,"High Park, The Junction South",4bf58dd8d48988d1f8941735,Furniture / Home Store,1,1,1,1,1,1
4,1.0,"High Park, The Junction South",4bf58dd8d48988d1f7941735,Flea Market,1,1,1,1,1,1


In [23]:
topNeighFreq.shape

(640, 10)

### Most common venue categories in the most similar Neighborhoods

In [24]:
#Applies a sum to the topNeigh after grouping it up by userId
tempTopNeighFreq = topNeighFreq.groupby(['Category ID','Venue Category']).sum()[['similarityIndex']]
tempTopNeighFreq.columns = ['sum_similarityIndex']
tempTopNeighFreq.sort_values(by='sum_similarityIndex', ascending=False, inplace=True)
tempTopNeighFreq.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum_similarityIndex
Category ID,Venue Category,Unnamed: 2_level_1
4bf58dd8d48988d16d941735,Café,10.866809
4bf58dd8d48988d1e0931735,Coffee Shop,9.366809
4bf58dd8d48988d16a941735,Bakery,8.670215
4bf58dd8d48988d110941735,Italian Restaurant,7.965109
4bf58dd8d48988d163941735,Park,7.715135
4bf58dd8d48988d1c4941735,Restaurant,7.261499
4bf58dd8d48988d111941735,Japanese Restaurant,6.647363
4bf58dd8d48988d149941735,Thai Restaurant,6.574352
4bf58dd8d48988d155941735,Gastropub,5.990489
4bf58dd8d48988d1ca941735,Pizza Place,5.981227


In [25]:
tempTopNeighFreq.shape

(183, 1)

In [26]:
tempTopNeighFreq.reset_index(inplace=True)

In [27]:
tempTopNeighFreq.columns

Index(['Category ID', 'Venue Category', 'sum_similarityIndex'], dtype='object')

### Exclude all the venue categories which already exist in Willowdale East

In [28]:
#tempTopNeighFreq.reset_index()
df_Recommendations=tempTopNeighFreq[-tempTopNeighFreq['Category ID'].isin(inputNeigh['Category ID'].tolist())]

In [29]:
df_Recommendations.shape

(157, 3)

In [30]:
df_Recommendations.head(20)

Unnamed: 0,Category ID,Venue Category,sum_similarityIndex
2,4bf58dd8d48988d16a941735,Bakery,8.670215
3,4bf58dd8d48988d110941735,Italian Restaurant,7.965109
4,4bf58dd8d48988d163941735,Park,7.715135
7,4bf58dd8d48988d149941735,Thai Restaurant,6.574352
8,4bf58dd8d48988d155941735,Gastropub,5.990489
10,4bf58dd8d48988d143941735,Breakfast Spot,5.556232
11,4bf58dd8d48988d147941735,Diner,5.50084
12,4bf58dd8d48988d10f951735,Pharmacy,5.230383
14,4bf58dd8d48988d1e2931735,Art Gallery,4.920902
15,4bf58dd8d48988d1ce941735,Seafood Restaurant,4.804869


### Exclude the categories from the recommendations, which are already existing in the closest neighborhoods

In [31]:
nextNeigh=pd.DataFrame()
nextNeigh=nextNeigh.append(venues_by_category[venues_by_category['Neighborhood']=='Willowdale, Willowdale West'])
nextNeigh=nextNeigh.append(venues_by_category[venues_by_category['Neighborhood']=='Willowdale, Newtonbrook'])
nextNeigh=nextNeigh.append(venues_by_category[venues_by_category['Neighborhood']=='Bayview Village'])

nextNeigh

Unnamed: 0,Neighborhood,Category ID,Venue Category,Neighborhood Latitude,Neighborhood Longitude,Frequency,Venue,Venue Latitude,Venue Longitude
1495,"Willowdale, Willowdale West",52f2ab2ebcbc57f1066b8b46,Supermarket,1,1,1,1,1,1
1496,"Willowdale, Willowdale West",4bf58dd8d48988d1e0931735,Coffee Shop,1,1,1,1,1,1
1497,"Willowdale, Willowdale West",4bf58dd8d48988d1ca941735,Pizza Place,1,1,1,1,1,1
1498,"Willowdale, Willowdale West",4bf58dd8d48988d10f951735,Pharmacy,1,1,1,1,1,1
1499,"Willowdale, Willowdale West",4bf58dd8d48988d118951735,Grocery Store,1,1,1,1,1,1
1467,"Willowdale, Newtonbrook",4bf58dd8d48988d163941735,Park,1,1,1,1,1,1
32,Bayview Village,4bf58dd8d48988d16d941735,Café,1,1,1,1,1,1
33,Bayview Village,4bf58dd8d48988d145941735,Chinese Restaurant,1,1,1,1,1,1
34,Bayview Village,4bf58dd8d48988d111941735,Japanese Restaurant,1,1,1,1,1,1
35,Bayview Village,4bf58dd8d48988d10a951735,Bank,1,1,1,1,1,1


In [32]:
df_Recommendations=df_Recommendations[-df_Recommendations['Category ID'].isin(nextNeigh['Category ID'].tolist())]
df_Recommendations.shape

(153, 3)

## Results

In [33]:
df_Recommendations.head(20)

Unnamed: 0,Category ID,Venue Category,sum_similarityIndex
2,4bf58dd8d48988d16a941735,Bakery,8.670215
3,4bf58dd8d48988d110941735,Italian Restaurant,7.965109
7,4bf58dd8d48988d149941735,Thai Restaurant,6.574352
8,4bf58dd8d48988d155941735,Gastropub,5.990489
10,4bf58dd8d48988d143941735,Breakfast Spot,5.556232
11,4bf58dd8d48988d147941735,Diner,5.50084
14,4bf58dd8d48988d1e2931735,Art Gallery,4.920902
15,4bf58dd8d48988d1ce941735,Seafood Restaurant,4.804869
16,4bf58dd8d48988d114951735,Bookstore,4.712067
17,4bf58dd8d48988d116941735,Bar,4.657645
