<h1 align=center><font size = 5>How to Move to a New City? - A Data Science Project using Location Intelligence </font></h1>
<h1 align=center><font size = 3>Aakash Vasudevan</font>


## 1. Introduction
Moving cities is a major life change for most people. Leaving behind their life, community and relationships built over many years can seem overwhelming. The inertia is compounded by the uncertainty in the new environment and consequently, the time investment required for planning and acclimatization. 

Given the current neighborhood of a person looking to relocate, can we find the most “similar” neighborhoods in the new city? This project aims to employ location intelligence and clustering to answer this question. After all, moving to a place with all the same restaurants, amenities, shops, parks…etc. would have a similar “vibe” that we are used to in our home city. The more familiarity we can manufacture, the less daunting the move becomes…

As a case study for the project, we will consider Alex, who is a recent engineering graduate from the University of Alberta living in the Strathcona neighborhood in Edmonton Alberta. Alex has accepted a job offer from a firm based out of Calgary, Alberta and is looking to relocate to a similar neighborhood in Calgary as Strathcona. 



## 2. Data and Dataset Sources

The idea is to represent neighborhoods in terms of their defining features in a high dimensional feature space and then cluster datapoints that are close together. The features that capture the essence of a neighborhood is an intricate and subtle problem in itself. For this exercise, we will broadly classify the features into those that represent the "people" in the neighborhood and those that represent the "places" in the neighborhood. 

Here is a flowchart showing the breakdown of the feature set:

<img src = "Data_Flowchart.png" />



It must be noted that the choice of features to represent neighborhoods is subjective and must be tailored to the end user (Alex in our case). The above features are by no means *the* unique combination to represent a neighborhood but merely provide a good starting point to illustrate the methodology used in this project. 

Datasets for features under the "People" category can be obtained through a combination of wikipedia and census data available online. Data for the "Places" category can be retrieved from the Foursquare database through their Places API. Finally, the latitude and longitude coordinates of the candidate neighborhoods can be obtained using the Python Geopy library.

### 2.1 Features under "People" Category

The features under the "people" category are Population Density and the Median Age and Household Income.

The Population Density by neighborhoods can be web scraped off the wikipedia page: https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Calgary

In [337]:
# Import Beautiful Soup package for web scraping
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Import pandas to store all data in dataframes
import pandas as pd

# Define the wikipedia url
url = 'https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Calgary'
html = urlopen(url,timeout = 10)

# Create Beautiful Soup object
soup = BeautifulSoup(html, 'html.parser')

# Find the table using class handle
table = soup.find("table",class_="wikitable sortable")

# Populate the neighborhood name and population density from the table one row at a time
c_name = []
c_pop_density = []

for row in table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) == 12:
        c_name.append(str(cells[0].find(text = True)))
        c_pop_density.append(str(cells[11].find(text = True)))

# Load a dictionary with the neighborhood names and population densities
d_1 = {'Neighborhood' : 0, 'Population Density' : 0}
d_1['Neighborhood'] = c_name
d_1['Population Density'] = c_pop_density

# Transfer the dictionary to a Data Frame
df_1 = pd.DataFrame(d_1)

# Display the Dataframe
df_1.head()


Unnamed: 0,Neighborhood,Population Density
0,Abbeydale,3480.6
1,Acadia,2744.9
2,Albert Park/Radisson Heights,2493.6
3,Altadore,3143.4
4,Alyth/Bonnybrook,4.2


The next data we need is the median income and age for each neighborhood in Calgary. Conveniently, this [website](https://great-news.ca/demographics/) publishes both these data. We will follow the same procedure as above to scrape and store in a data frame.

In [336]:
# Define the url
url_1 = 'https://great-news.ca/demographics/'
html_1 = urlopen(url_1,timeout = 10)

# Create Beautiful Soup object
soup_1 = BeautifulSoup(html_1, 'html.parser')


# Find the table using the class handle
table_1 = soup_1.find("table",class_="tablepress tablepress-id-121")

# Populate the neighborhood name, median income and age from the table one row at a time 
comm = []
c_median_age = []
c_median_income = []

for row in table_1.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) == 8:
        comm.append(str(cells[0].find(text = True)))
        c_median_age.append(str(cells[3].find(text = True)))
        c_median_income.append(str(cells[2].find(text = True)))

# Load a dictionary with the neighborhood names, median age and income
d_2 = {'Neighborhood' : 0, 'Median Age' : 0, 'Median Income' : 0}
d_2['Neighborhood'] = comm
d_2['Median Age'] = c_median_age
d_2['Median Income'] = c_median_income

# Transfer to a dataframe and display
df_2 = pd.DataFrame(d_2)
df_2.head()

Unnamed: 0,Neighborhood,Median Age,Median Income
0,Abbeydale,34,"$55,345"
1,Acadia,42,"$46,089"
2,Albert Park / Radisson Heights,37,"$38,019"
3,Altadore,37,"$53,786"
4,Applewood Park,33,"$65,724"


We can now combine the data frames. 

In [152]:
print('Population Density Data Frame has {} rows'.format(df_1.shape[0]))
print('Median age and income Data Frame has {} rows'.format(df_2.shape[0]))

Population Density Data Frame has 257 rows
Median age and income Data Frame has 180 rows


We see that the two data frames don't quite have the same number of neighborhoods. However, as will see later, we have enough neighborhood samples to cover the entire city of Calgary. Therefore, we will proceed with merging the two data frames and dropping the neighborhoods that are not in the intersection of the two data frames.

In [338]:
df_combined = pd.merge(df_1, df_2, how = 'inner', on = ['Neighborhood'])
df_combined

Unnamed: 0,Neighborhood,Population Density,Median Age,Median Income
0,Abbeydale,3480.6,34,"$55,345"
1,Acadia,2744.9,42,"$46,089"
2,Altadore,3143.4,37,"$53,786"
3,Applewood Park,4061.3,33,"$65,724"
4,Arbour Lake,2462.7,41,"$70,590"
...,...,...,...,...
170,Willow Park,1537.9,45,"$63,588"
171,Windsor Park,3173.8,37,"$39,425"
172,Winston Heights/Mountview,1297,42,"$41,065"
173,Woodbine,2853.4,42,"$83,844"


### 2.2 Latitude and Longitude Coordinates of all Neighborhoods

We will use the **geopy** library to pass each neighborhood to the **Nominatim** object and obtain the corresponding latitude and longitude coordinates.

In [154]:
# Import the library
from geopy.geocoders import Nominatim


addresses = df_combined['Neighborhood'] # Load all neighborhoods
latitude = [] # Initialize
longitude = [] # Initialize

# Initialize object
geolocator = Nominatim(user_agent="ca_explorer")

# Obtain latitude and longitude coordinates
for address in addresses:
    try:
        location = geolocator.geocode(address+', Calgary AB')
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
        print('{} is not available'.format(address+', Calgary AB'))


CFB Currie, Calgary AB is not available


We see that the geolocator couldn't return the coordinates for neighborhood *CFB Currie*. Let's take a closer look at this neighborhood.

In [132]:
df_combined[df_combined['Name'] == 'CFB Currie']

Unnamed: 0,Name,Population Density,Median Age,Median Income
25,CFB Currie,156.4,52,"$81,542"


We see that this neighborhood is in index 25. Since this is only a single data point, we can either drop it or manually enter the latitude and longitude coordinates off a Google search. We will elect to enter the coordinates from Google.

In [155]:
# The coordinates for CFB Currie
latitude.insert(24,51.01808963769229)
longitude.insert(24,-114.1245750909354)

Finally, let us add the latitude and longitude coordinates to the data frame and cleanup the columns

In [339]:
# Add the Latitude and Longitude coordinates to the data frame
df_combined['Latitude'] = pd.DataFrame(latitude)
df_combined['Longitude'] = pd.DataFrame(longitude)

# Re-arrange columns
cols = ['Neighborhood' , 'Latitude' , 'Longitude' , 'Median Age' , 'Median Income' , 'Population Density']
df_combined = df_combined[cols]

df_combined.head()


Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density
0,Abbeydale,51.058836,-113.929413,34,"$55,345",3480.6
1,Acadia,50.968655,-114.055587,42,"$46,089",2744.9
2,Altadore,51.015104,-114.100756,37,"$53,786",3143.4
3,Applewood Park,51.044658,-113.928931,33,"$65,724",4061.3
4,Arbour Lake,51.136786,-114.202355,41,"$70,590",2462.7


It would also help to visualize the neighborhoods on a map as a sanity check. We can use the **folium** library to create the map.

In [158]:
import folium # map rendering library
calgary_latitude = 51.0447
calgary_longitude = -114.0719

# create map of Calgary using latitude and longitude values
map_calgary = folium.Map(location=[calgary_latitude, calgary_longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df_combined['Latitude'], df_combined['Longitude'], df_combined['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_calgary)  
    
map_calgary

As we can see, we have a reasonably good spread of neighborhoods throughout the city with a few odd sparse and dense locations. Depending on the use case, it might be necessary to revisit the list of neighborhoods that were dropped when we merged the two mismatching data frames earlier. However, for our purposes, the above samples are sufficient. 

### 2.3 Features under "Places" Category

As mentioned before, all the features under the "Places" category can be obtained through the **Foursquare Places API**. We will be using the *venues/explore* endpoint to obtain a json file with all the venues under each category in the vicinity of each neighborhood. 

In [164]:
# Import all libraries
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [161]:
CLIENT_ID = '1GUDSWNGRLY1WTCSNIJT3GEPYQ2PCMZ5FWT2J1BJBKCFK2JO' # your Foursquare ID
CLIENT_SECRET = 'VVACPCLNET2ZKQWGPSERQ0QWQ1A2E11WXFHGBSQ3U2FLOWNB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1GUDSWNGRLY1WTCSNIJT3GEPYQ2PCMZ5FWT2J1BJBKCFK2JO
CLIENT_SECRET:VVACPCLNET2ZKQWGPSERQ0QWQ1A2E11WXFHGBSQ3U2FLOWNB


In [162]:
# Function to return a Data Frame of all venues retrieved from the Foursquare API near a given latitude and longitude location. Radius is set to 500m by default.

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [165]:
# Pass all neighborhoods in Calgary to the getNearbyVenues function and store the resulting dataframe containing all the venues within 500m of the neighborhood in Calgary_venues
Calgary_venues = getNearbyVenues(df_combined['Neighborhood'] , df_combined['Latitude'] , df_combined['Longitude'])

Abbeydale
Acadia
Altadore
Applewood Park
Arbour Lake
Aspen Woods
Auburn Bay
Banff Trail
Bankview
Bayview
Beddington Heights
Bel-Aire
Beltline
Bonavista Downs
Bowness
Braeside
Brentwood
Bridgeland/Riverside
Bridlewood
Britannia
Cambrian Heights
Canyon Meadows
Capitol Hill
Castleridge
Cedarbrae
CFB Currie
Chaparral
Charleswood
Chinook Park
Christie Park
Citadel
Cliff Bungalow
Coach Hill
Collingwood
Copperfield
Coral Springs
Cougar Ridge
Country Hills
Country Hills Village
Coventry Hills
Cranston
Crescent Heights
Crestmont
Dalhousie
Deer Ridge
Deer Run
Diamond Cove
Discovery Ridge
Dover
Eagle Ridge
Eau Claire
Edgemont
Elbow Park
Elboya
Erin Woods
Erlton
Evanston
Evergreen
Fairview
Falconridge
Forest Heights
Forest Lawn
Glamorgan
Glenbrook
Glendale
Greenview
Hamptons
Harvest Hills
Hawkwood
Haysboro
Hidden Valley
Highland Park
Highwood
Hillhurst
Hounsfield Heights/Briar Hill
Huntington Hills
Inglewood
Kelvin Grove
Killarney/Glengarry
Kincora
Kingsland
Lake Bonavista
Lakeview
Legacy
Lincoln 

In [166]:
Calgary_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbeydale,51.058836,-113.929413,Subway,51.059239,-113.934423,Sandwich Place
1,Abbeydale,51.058836,-113.929413,Mac's,51.059376,-113.934425,Convenience Store
2,Abbeydale,51.058836,-113.929413,roadside pub,51.059277,-113.934529,Wings Joint
3,Acadia,50.968655,-114.055587,Bow Valley Insurance,50.967936,-114.051084,Insurance Office
4,Acadia,50.968655,-114.055587,Highwest Electric Ltd,50.965847,-114.057257,Construction & Landscaping
...,...,...,...,...,...,...,...
1047,Winston Heights/Mountview,51.072303,-114.047588,Mount View School Age and Family Care Center,51.069705,-114.051977,College Classroom
1048,Woodlands,50.942435,-114.109359,3 Crowns,50.940765,-114.109430,Pub
1049,Woodlands,50.942435,-114.109359,Russian Store,50.941063,-114.109452,Food & Drink Shop
1050,Woodlands,50.942435,-114.109359,Woodpark Liquor,50.941202,-114.109502,Liquor Store


In [167]:
print('There are {} uniques categories.'.format(len(Calgary_venues['Venue Category'].unique())))

There are 196 uniques categories.


The results from the Foursquare API has been stored in the *Calgary_venues* dataframe. The data frame contains all the venues and venue categories for each neighborhood. There are a total of 196 unique categories returned from the Foursquare API.

We will now encode all the categories for each neighborhood using one-hot encoding. The result is a new dataframe containing all the categories as separate columns with either a "1" indicating that that category is in proximity to the neighborhood or "0" otherwise.

In [168]:
# one hot encoding
Calgary_onehot = pd.get_dummies(Calgary_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Calgary_onehot['Neighborhood'] = Calgary_venues['Neighborhood']

# move neighborhood column to the first column
cols = Calgary_onehot.columns.tolist()
old_index = cols.index('Neighborhood')
cols.insert(0,cols.pop(old_index))

Calgary_onehot = Calgary_onehot[cols]

Calgary_onehot.shape

(1052, 196)

In [169]:
Calgary_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Abbeydale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Abbeydale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Abbeydale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Acadia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Acadia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [171]:
# Group by neighborhood and average
Calgary_grouped = Calgary_onehot.groupby('Neighborhood').mean().reset_index()
Calgary_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Abbeydale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
1,Acadia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Altadore,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Applewood Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arbour Lake,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The above data frame can be interpreted as the relative occurrence of each venue for a given neighborhood. For example, the category "Wings Joint" is one amongst three venues that are in proximity to Abbeydale. 

Let us transform this data frame to be a little bit more intuitive. Rather than the relative occurance, we will sort each neighborhood by the 1st, 2nd and 3rd most common venues in proximity to that neighborhood.

In [174]:
# Function to return a set number of most common venues near a neighborhood
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [175]:
import numpy as np
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Calgary_grouped['Neighborhood']

for ind in np.arange(Calgary_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Calgary_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.shape


(159, 6)

In [176]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Abbeydale,Wings Joint,Convenience Store,Sandwich Place,Yoga Studio,Food Court
1,Acadia,Insurance Office,Construction & Landscaping,Yoga Studio,Food Service,Golf Course
2,Altadore,Dog Run,Coffee Shop,Massage Studio,Greek Restaurant,Gourmet Shop
3,Applewood Park,Park,Liquor Store,Home Service,Yoga Studio,Food Service
4,Arbour Lake,Bus Station,Moving Target,Grocery Store,Lake,Residential Building (Apartment / Condo)


Comparing the shape of the above dataframe with our original data frame at the end of section 2.2, we note that 16 neighborhoods are missing. After a few hours of investigation, we would notice that the method to return the Foursquare API results does not play well when the result set is empty. That is, the API did not return any venues for 16 neighborhoods and since these values were missing in our data frame, they vanished when the *groupby.mean* aggregate method was used. 

To ensure shape compatibility for implementing the clustering algorithm later on, we will drop these 16 neighborhoods from our original data frame.

In [192]:
# Drop any neighborhoods that did not retrieve any venues from the Foursquare API
ex = np.setxor1d(neighborhoods_venues_sorted['Neighborhood'] , df_combined['Neighborhood'])
i, = np.nonzero(np.in1d(df_combined['Neighborhood'], ex))

df_combined.drop(i,axis = 0,inplace = True)
df_combined.reset_index(drop = True)
df_combined.shape # Ensure the no. of rows of the dataframe matches the venues dataframe

(159, 6)

### 2.4 Final Dataset

In this section, we will compile the final data set that will be used in the learning algorithm. The data set will contain all the neighborhoods and features that represent the "People" and "Places" categories. 

In [374]:
Calgary_Neighborhood_Features = pd.merge(df_combined, Calgary_grouped, how = 'inner', on = 'Neighborhood')
Calgary_Neighborhood_Features.shape

(159, 201)

In [375]:
### Change 'Median Age', 'Median Income' and 'Population Density' columns to float type

Calgary_Neighborhood_Features['Median Age'] = Calgary_Neighborhood_Features['Median Age'].astype('float')

Temp = Calgary_Neighborhood_Features['Population Density'].replace(',','',regex = True)
Temp = Temp.astype('float')
Calgary_Neighborhood_Features['Population Density'] = Temp

Temp = Calgary_Neighborhood_Features['Median Income'].str.replace('$','')
Temp = Temp.str.replace(',','')
Temp = Temp.astype('float')
Calgary_Neighborhood_Features['Median Income'] = Temp

Calgary_Neighborhood_Features.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Abbeydale,51.058836,-113.929413,34.0,55345.0,3480.6,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
1,Acadia,50.968655,-114.055587,42.0,46089.0,2744.9,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Altadore,51.015104,-114.100756,37.0,53786.0,3143.4,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Applewood Park,51.044658,-113.928931,33.0,65724.0,4061.3,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arbour Lake,51.136786,-114.202355,41.0,70590.0,2462.7,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Returning to our case study, since we are intending to employ a clustering algorithm to determine "similar" neighborhoods to Alex's current neighborhood, we need to add "Strathcona" as an extra data point to the above dataset. 

In [206]:
# Obtain the Median Age, Median Income and Population Density for Strathcona
Median_Income = 68403
Median_Age = 35
Population_Density = 5722.3

#Latitude and Longitude Coordinates
Str_lat = 53.522
Str_lng = -113.492


In [217]:
# Obtain the venues 
venues_list = []
radius = 500
 # create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            Str_lat, 
            Str_lng, 
            radius, 
            LIMIT)
            
        # make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
venues_list.append([(
            'Strathcona', 
            Str_lat, 
            Str_lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    

In [259]:
Str_nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
Str_nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

# One hot encoding by the Venue Category
Strathcona_onehot = pd.get_dummies(Str_nearby_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Strathcona_onehot['Neighborhood'] = Str_nearby_venues['Neighborhood']

# move neighborhood column to the first column
cols = Strathcona_onehot.columns.tolist()
old_index = cols.index('Neighborhood')
cols.insert(0,cols.pop(old_index))

Strathcona_onehot = Strathcona_onehot[cols]

# Group by and average
Strathcona_grouped = Strathcona_onehot.groupby('Neighborhood').mean().reset_index()

# Insert the "People" and "Location" category features
Strathcona_grouped['Latitude'] = Str_lat
Strathcona_grouped['Longitude'] = Str_lng
Strathcona_grouped['Median Age'] = Median_Age
Strathcona_grouped['Median Income'] = Median_Income
Strathcona_grouped['Population Density'] = Population_Density

Strathcona_grouped

Unnamed: 0,Neighborhood,Bar,Breakfast Spot,Café,Diner,Farmers Market,General Entertainment,Grocery Store,Indian Restaurant,Japanese Restaurant,...,Pizza Place,Public Art,Theater,Train Station,Vietnamese Restaurant,Latitude,Longitude,Median Age,Median Income,Population Density
0,Strathcona,0.041667,0.041667,0.125,0.041667,0.041667,0.041667,0.083333,0.041667,0.083333,...,0.041667,0.041667,0.083333,0.041667,0.041667,53.522,-113.492,35,68403,5722.3


In [396]:
# Append the Strathcona dataframe to the Calgary Neighborhood data frame
Final_Dataset = Calgary_Neighborhood_Features.append(Strathcona_grouped, ignore_index = True)

# Replace NaN with zero
Final_Dataset.fillna(0,inplace=True)

Final_Dataset

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,...,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Farmers Market,General Entertainment,Jazz Club,Public Art
0,Abbeydale,51.058836,-113.929413,34.0,55345.0,3480.6,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.333333,0.0,0.0,0.000000,0.000000,0.000000,0.000000
1,Acadia,50.968655,-114.055587,42.0,46089.0,2744.9,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
2,Altadore,51.015104,-114.100756,37.0,53786.0,3143.4,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
3,Applewood Park,51.044658,-113.928931,33.0,65724.0,4061.3,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
4,Arbour Lake,51.136786,-114.202355,41.0,70590.0,2462.7,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,Willow Park,50.960293,-114.054645,45.0,63588.0,1537.9,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
156,Windsor Park,51.006165,-114.076187,37.0,39425.0,3173.8,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
157,Winston Heights/Mountview,51.072303,-114.047588,42.0,41065.0,1297.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000
158,Woodlands,50.942435,-114.109359,40.0,71234.0,2214.6,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000


Finally, we see that the Median Age, Median Income and Population Density features are disproportionately greater than the other features. This might induce distortions in the K-means clusters as variables with large variance will tend to be more separated than variables with small variance. To overcome this, we will normalize the data in all the columns.

In [399]:
from sklearn.preprocessing import StandardScaler


X = Final_Dataset.values[:,3:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset

array([[-0.88641374, -0.50623409,  0.56080048, ..., -0.07930516,
        -0.07930516, -0.07930516],
       [ 0.47894076, -0.85267489,  0.02377804, ..., -0.07930516,
        -0.07930516, -0.07930516],
       [-0.3744058 , -0.56458556,  0.31466216, ..., -0.07930516,
        -0.07930516, -0.07930516],
       ...,
       [ 0.47894076, -1.04071708, -1.03311307, ..., -0.07930516,
        -0.07930516, -0.07930516],
       [ 0.13760213,  0.08847181, -0.36331316, ..., -0.07930516,
        -0.07930516, -0.07930516],
       [-0.71574443, -0.01748907,  2.19712398, ..., 12.60952021,
        12.60952021, 12.60952021]])

*Finally*, we are ready to proceed with the clustering

## 3. Methodology

### 3.1 Naive Clustering

The general strategy is to apply a clustering algorithm (whether kmeans or DBSCAN) to segment and label neighborhoods in Calgary that are similar to Strathcona. We will first naively apply the clustering algorithm to the dataset *cluster_dataset* from the last section. When visualizing the clusters, we will note some interesting patterns that will suggest that the clusters are not quite what we intended.

In [444]:
# Import library
from sklearn.cluster import KMeans 

# set number of clusters
kclusters = 10

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init = 'k-means++', n_init = 12, random_state=0).fit(cluster_dataset)

# check cluster labels generated for each row in the dataframe
labels = kmeans.labels_

In [451]:
Final_Dataset['k-Labels'] = labels
Final_Dataset.tail()

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,...,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Farmers Market,General Entertainment,Jazz Club,Public Art,k-Labels
155,Willow Park,50.960293,-114.054645,45.0,63588.0,1537.9,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
156,Windsor Park,51.006165,-114.076187,37.0,39425.0,3173.8,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
157,Winston Heights/Mountview,51.072303,-114.047588,42.0,41065.0,1297.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
158,Woodlands,50.942435,-114.109359,40.0,71234.0,2214.6,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
159,Strathcona,53.522,-113.492,35.0,68403.0,5722.3,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.041667,0.041667,0.041667,0.041667,2


In [452]:
calgary_latitude = 51.0447
calgary_longitude = -114.0719

# create map of Calgary using latitude and longitude values
map_clusters = folium.Map(location=[calgary_latitude, calgary_longitude], zoom_start=11)


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Final_Dataset['Latitude'], Final_Dataset['Longitude'], Final_Dataset['Neighborhood'], Final_Dataset['k-Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    if cluster == 2:
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='blue',
            fill_opacity=0.7).add_to(map_clusters)

map_clusters

Clearly, the K-means clustering seems to indicate that most neighborhoods in Calgary are pretty similar to the Strathcona neighborhood in Edmonton. However, intuitively, we know that this cannot be true. As a concrete example, Signal Hill located near the western boundary of the city has a reputation for being a quiet residential neighborhood. On the other hand, Strathcona is a neighborhood known for its busy pubs, restaurants and night clubs. How do we explain this discrepancy?


## 3.2 Curse of the Dimensions

As it turns out, clustering in high dimensions is a bit tricky. Our traditional "distance" metric to measure similarity between two sample points tends to zero as the number of dimensions increase. That is, any two data points will have a distance value close to zero if the feature space dimensionality is very high. This can be verified to some extent by playing with the epsilon parameter in the DBSCAN algorithm and visualizing the effect (or lack thereof) on the labels.  

The most meaningful fix in our case would be to reduce the feature space dimensionality. Alex has kindly narrowed down his most important "Places" features to the following:
1. Bar
2. Train Station
3. Cafe
3. Grocery Store
4. Indian Restaurant



Let's build a new reduced dataset with only these features for the Places category.

In [464]:
Filtered_Dataset = Final_Dataset[['Neighborhood','Latitude','Longitude','Median Age','Median Income','Population Density','Bar','Train Station','Café','Grocery Store','Indian Restaurant']]

In [465]:
Filtered_Dataset

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density,Bar,Train Station,Café,Grocery Store,Indian Restaurant
0,Abbeydale,51.058836,-113.929413,34.0,55345.0,3480.6,0.000000,0.000000,0.000,0.000000,0.000000
1,Acadia,50.968655,-114.055587,42.0,46089.0,2744.9,0.000000,0.000000,0.000,0.000000,0.000000
2,Altadore,51.015104,-114.100756,37.0,53786.0,3143.4,0.000000,0.000000,0.000,0.000000,0.000000
3,Applewood Park,51.044658,-113.928931,33.0,65724.0,4061.3,0.000000,0.000000,0.000,0.000000,0.000000
4,Arbour Lake,51.136786,-114.202355,41.0,70590.0,2462.7,0.000000,0.000000,0.000,0.166667,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
155,Willow Park,50.960293,-114.054645,45.0,63588.0,1537.9,0.000000,0.000000,0.000,0.200000,0.000000
156,Windsor Park,51.006165,-114.076187,37.0,39425.0,3173.8,0.000000,0.000000,0.000,0.000000,0.000000
157,Winston Heights/Mountview,51.072303,-114.047588,42.0,41065.0,1297.0,0.000000,0.000000,0.000,0.000000,0.000000
158,Woodlands,50.942435,-114.109359,40.0,71234.0,2214.6,0.000000,0.000000,0.000,0.000000,0.000000


## 3.3 Clustering on Reduced Feature Set Dimensions

In [466]:
# Normalize the data points as before
X = Filtered_Dataset.values[:,3:]
X = np.nan_to_num(X)
cluster_filter_dataset = StandardScaler().fit_transform(X)
cluster_filter_dataset

array([[-0.88641374, -0.50623409,  0.56080048, ..., -0.30803938,
        -0.38147216, -0.15671419],
       [ 0.47894076, -0.85267489,  0.02377804, ..., -0.30803938,
        -0.38147216, -0.15671419],
       [-0.3744058 , -0.56458556,  0.31466216, ..., -0.30803938,
        -0.38147216, -0.15671419],
       ...,
       [ 0.47894076, -1.04071708, -1.03311307, ..., -0.30803938,
        -0.38147216, -0.15671419],
       [ 0.13760213,  0.08847181, -0.36331316, ..., -0.30803938,
        -0.38147216, -0.15671419],
       [-0.71574443, -0.01748907,  2.19712398, ...,  1.96481883,
         1.10521037,  0.71803493]])

In [472]:
# set number of clusters
kclusters = 10

# run k-means clustering
kmeans_filter = KMeans(n_clusters=kclusters, init = 'k-means++', n_init = 12, random_state=0).fit(cluster_filter_dataset)

# check cluster labels generated for each row in the dataframe
filtered_labels = kmeans_filter.labels_

In [510]:
Filtered_Dataset['k-Labels'] = filtered_labels
Filtered_Dataset.tail()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density,Bar,Train Station,Café,Grocery Store,Indian Restaurant,k-Labels
155,Willow Park,50.960293,-114.054645,45.0,63588.0,1537.9,0.0,0.0,0.0,0.2,0.0,2
156,Windsor Park,51.006165,-114.076187,37.0,39425.0,3173.8,0.0,0.0,0.0,0.0,0.0,8
157,Winston Heights/Mountview,51.072303,-114.047588,42.0,41065.0,1297.0,0.0,0.0,0.0,0.0,0.0,1
158,Woodlands,50.942435,-114.109359,40.0,71234.0,2214.6,0.0,0.0,0.0,0.0,0.0,1
159,Strathcona,53.522,-113.492,35.0,68403.0,5722.3,0.041667,0.041667,0.125,0.083333,0.041667,3


In [474]:
calgary_latitude = 51.0447
calgary_longitude = -114.0719

# create map of Calgary using latitude and longitude values
map_clusters = folium.Map(location=[calgary_latitude, calgary_longitude], zoom_start=11)


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Filtered_Dataset['Latitude'], Filtered_Dataset['Longitude'], Filtered_Dataset['Neighborhood'], Filtered_Dataset['k-Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    if cluster == 3:
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='blue',
            fill_opacity=0.7).add_to(map_clusters)

map_clusters

In [509]:
Filtered_Dataset[Filtered_Dataset['k-Labels'] == 3]

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Age,Median Income,Population Density,Bar,Train Station,Café,Grocery Store,Indian Restaurant,k-Labels
8,Bankview,51.033887,-114.099518,32.0,32474.0,7458.6,0.0,0.0,0.0,0.142857,0.0,3
12,Beltline,51.040498,-114.072593,33.0,33901.0,6786.6,0.053571,0.0,0.017857,0.0,0.0,3
29,Cliff Bungalow,51.034436,-114.073833,32.0,35576.0,4840.0,0.030303,0.0,0.030303,0.030303,0.030303,3
78,Lower Mount Royal,51.036645,-114.087139,34.0,35570.0,10600.0,0.0,0.0,0.0,0.02,0.0,3
91,Mission,51.031758,-114.06672,34.0,37040.0,8650.0,0.0,0.0,0.027778,0.027778,0.027778,3
159,Strathcona,53.522,-113.492,35.0,68403.0,5722.3,0.041667,0.041667,0.125,0.083333,0.041667,3


# 4. Discussion

Let us use our intuition and domain knowledge to try and validate the results. The neighborhoods classified to be similar to Strathcona are all in lively locations just outside the core Calgary downtown area. This is a promising start as Strathcona is known for its busy bars and night life as mentioned earlier. These neighborhoods all have lots of bars, cafes, grocery stores and indian restaurants within walking distance. Moreover, since many companies have offices downtown, young professionals tend to stay in these neighborhoods, as evidenced by the median age. Finally, as someone that has stayed in both Strathcona and Lower Mount Royal neighborhoods, I can confirm that the ambience and "vibe" are indeed pretty similar. 


The clustering has yielded five candidate neighborhoods that Alex can consider for his relocation destination. The next step is to further assess these neighborhoods based on other preferences and come up with a ranking list. For example, Alex may prefer to be as close to work as possible to minimize travel time in the winter. Or his priority may be to minimize rental costs since a significant chunk of his salary must be allocated for student loan repayment. If there are multiple competing considerations, a weighted average metric could be employed to come up with a ranking.


# 5. Results

The K-means clustering algorithm was successfully employed on the reduced feature data set to obtain a reasonable cluster of neighborhoods that are similar to Strathcona. Obviously, the methodology can be extended to any combination of cities and with more intricate feature selection, perhaps even across countries. 

To avoid the dimensionality issue, it is recommended that the training set be restricted to 10 or less feature variables. This would involve some careful consideration in choosing the best subset of features that could have the most meaningful impact for the use case. As with most data science projects, some iteration might be required to arrive at a reasonable combination of features and clusters.



# 6. References and Data Sources 

Strathcona Edmonton - Population Density
https://en.wikipedia.org/wiki/Strathcona,_Edmonton

List of Neighborhoods in Calgary and Population Density
https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Calgary

Median Age and Household Income for Calgary Neighborhoods
https://great-news.ca/demographics/

Strathcona Edmonton - Median Age and Income
https://public.tableau.com/profile/city.of.edmonton#!/vizhome/2019EdmontonMunicipalCensus/2019EdmontonMunicipalCensusNeighbourhood

Location Information - Foursquare API
https://developer.foursquare.com/docs/places-api/