# Visualizing Toronto Neighborhood Data and Cluster

In this notebook, we will analyze each neighborhood in our Toronto dataset and attach to each one a set of locations - those types of locations which recur most often, as detailed by the Forsquare API.  


## Goals:
* Map all neighborhoods 
* Organize all neighborhoods by number/count of venue categories
* Use said data and a K-Means algorithm to organize neighborhoods into 4 groups
* Map those 4 groups


## Part 1, Map all neighborhoods

In [1]:
# import libraries

import pandas as pd 
import numpy as np 

import json 
import foursquare

import matplotlib.cm as cm 
import matplotlib.colors as colors 
import folium

from sklearn.cluster import KMeans

In [2]:
# import toronto data with latitudes and longitudes 

codes_df = pd.read_csv('toronto_geo_data.csv')
codes_df.drop(columns=['Unnamed: 0'], inplace=True)
codes_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494


In [3]:
codes_df.shape

(103, 5)

In [4]:
# Create folium map

toronto_ll = [43.6532, -79.3832]
m = folium.Map(location=toronto_ll, zoom_start=11)

# Add markers
for lat, lng, borough, neigh in zip(codes_df['Latitude'], codes_df['Longitude'], codes_df['Borough'], codes_df['Neighborhood']):
    label = f"<h3>{neigh}</h3> <p>{borough}</p>"
    label = folium.Popup(label, parse_html=False)
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color="blue",
        fill=True,
        fill_color="blue",
        fill_opacity=0.5,
        parse_html=True).add_to(m)

m

## Part 2, Organize neighborhoods by venues

In [5]:
# Get client_id and client_secret

with open('credentials.json') as json_data:
    json_data = json.loads(json_data.read())
    CLIENT_ID = json_data['client_id']
    CLIENT_SECRET = json_data['client_secret']

# Initialize Foursquare client
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, version="20200501")


In [6]:
# Create Function to get nearest businesses by lat/lng

def get_nearest(lat, lng):
    DISTANCE = 500 # meters
    LIMIT = 30

    venue_list = []
    res = client.venues.explore(params={"ll": f"{lat}, {lng}", "radius":DISTANCE, "limit":LIMIT})
    
    for venue in res['groups'][0]['items']:
        venue_list.append({
            'name': venue['venue']['name'],
            'lat' : venue['venue']['location']['lat'],
            'lng' : venue['venue']['location']['lng'],
            'category' : venue['venue']['categories'][0]['name']
        })
    return venue_list

In [8]:
# Iterate over whole list of neighborhoods

venues_obj = {
    "neighborhood" : [],
    "neighborhood_lat" : [],
    "neighborhood_lng" : [],
    "venue" : [],
    "venue_lat" : [],
    "venue_lng" : [],
    "venue_cat" : []
}

for neigh, lat, lng in zip(codes_df['Neighborhood'], codes_df['Latitude'], codes_df['Longitude']):
    venues = get_nearest(lat, lng)

    print(f"Retreiving venues for {neigh}")
    for v in venues:
        venues_obj['neighborhood'].append(neigh)
        venues_obj['neighborhood_lat'].append(lat)
        venues_obj['neighborhood_lng'].append(lng)
        venues_obj['venue'].append(v['name'])
        venues_obj['venue_lat'].append(v['lat'])
        venues_obj['venue_lng'].append(v['lng'])
        venues_obj['venue_cat'].append(v['category'])

Retreiving venues for Parkwoods
Retreiving venues for Victoria Village
Retreiving venues for Regent Park / Harbourfront
Retreiving venues for Lawrence Manor / Lawrence Heights
Retreiving venues for Queen's Park / Ontario Provincial Government
Retreiving venues for Islington Avenue
Retreiving venues for Malvern / Rouge
Retreiving venues for Don Mills
Retreiving venues for Parkview Hill / Woodbine Gardens
Retreiving venues for Garden District / Ryerson
Retreiving venues for Glencairn
Retreiving venues for West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale
Retreiving venues for Rouge Hill / Port Union / Highland Creek
Retreiving venues for Don Mills
Retreiving venues for Woodbine Heights
Retreiving venues for St. James Town
Retreiving venues for Humewood-Cedarvale
Retreiving venues for Eringate / Bloordale Gardens / Old Burnhamthorpe / Markland Wood
Retreiving venues for Guildwood / Morningside / West Hill
Retreiving venues for The Beaches
Retreiving venues for Ber

In [9]:
# Create DataFrame

venues_df = pd.DataFrame(venues_obj)

In [10]:
# Explore

venues_df.head()
len(venues_df)
len(venues_df['venue_cat'].unique())

231

In [11]:
# Reorganize into weighted venue categories, organized by neighborhood

weighted_venues_df = pd.get_dummies(venues_df[['venue_cat']], prefix="", prefix_sep="")
weighted_venues_df['neighborhood'] = venues_df['neighborhood']

tempcol = [weighted_venues_df.columns[-1]] + list(weighted_venues_df.columns[:-1])
weighted_venues_df = weighted_venues_df[tempcol]

# Get weighted average 

weighted_venues_df = weighted_venues_df.groupby('neighborhood').mean().reset_index()
weighted_venues_df


Unnamed: 0,neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
1,Alderwood / Long Branch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
2,Bathurst Manor / Wilson Heights / Downsview North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.055556,0.000000,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
4,Bedford Park / Lawrence Manor East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,Willowdale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.027778,0.0,0.0,0.0,0.0,0.0
91,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
92,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.111111,0.000000,0.0,0.0,0.0,0.0,0.0
93,York Mills / Silver Hills,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0


In [20]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Get top venues per neighborhood

num_top_venues = 10

columns = ['neighborhood']
for ind in np.arange(num_top_venues):
    columns.append(f"No. {ind + 1} Most Common Venue")

# Sort Dataframe 

neighborhoods_by_venue_df = pd.DataFrame(columns = columns)
neighborhoods_by_venue_df['neighborhood'] = weighted_venues_df['neighborhood']

for ind in np.arange(weighted_venues_df.shape[0]):
    neighborhoods_by_venue_df.iloc[ind, 1:] = return_most_common_venues(weighted_venues_df.iloc[ind, :], num_top_venues)

neighborhoods_by_venue_df


Unnamed: 0,neighborhood,No. 1 Most Common Venue,No. 2 Most Common Venue,No. 3 Most Common Venue,No. 4 Most Common Venue,No. 5 Most Common Venue,No. 6 Most Common Venue,No. 7 Most Common Venue,No. 8 Most Common Venue,No. 9 Most Common Venue,No. 10 Most Common Venue
0,Agincourt,Latin American Restaurant,Breakfast Spot,Skating Rink,Clothing Store,Lounge,Yoga Studio,Dim Sum Restaurant,Event Space,Electronics Store,Eastern European Restaurant
1,Alderwood / Long Branch,Pizza Place,Gym,Pool,Coffee Shop,Skating Rink,Pharmacy,Pub,Athletics & Sports,Sandwich Place,Department Store
2,Bathurst Manor / Wilson Heights / Downsview North,Coffee Shop,Bank,Shopping Mall,Pharmacy,Ice Cream Shop,Middle Eastern Restaurant,Restaurant,Fried Chicken Joint,Diner,Deli / Bodega
3,Bayview Village,Café,Japanese Restaurant,Bank,Chinese Restaurant,Dim Sum Restaurant,Falafel Restaurant,Event Space,Electronics Store,Eastern European Restaurant,Drugstore
4,Bedford Park / Lawrence Manor East,Coffee Shop,Restaurant,Sandwich Place,Italian Restaurant,Sushi Restaurant,Pharmacy,Pizza Place,Pub,Café,Butcher
...,...,...,...,...,...,...,...,...,...,...,...
90,Willowdale,Coffee Shop,Ramen Restaurant,Pizza Place,Café,Grocery Store,Sandwich Place,Japanese Restaurant,Discount Store,Ice Cream Shop,Electronics Store
91,Woburn,Coffee Shop,Indian Restaurant,Korean Restaurant,Yoga Studio,Falafel Restaurant,Event Space,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
92,Woodbine Heights,Dance Studio,Park,Spa,Skating Rink,Pharmacy,Video Store,Beer Store,Curling Ice,Cosmetics Shop,Distribution Center
93,York Mills / Silver Hills,Cafeteria,Yoga Studio,Falafel Restaurant,Event Space,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run,Distribution Center


## Part 3, Cluster neighborhoods by Venues

### Note:

We want our data to be clustered purely off of the relationship and occurence of each type of venue.  Therefore, we will drop the "neighborhood" column before we start training.   We are using the dataframe of weighted values, not the printed values, found above.

In [16]:
# Create Model 
n_clusters = 4

cluster_data_df = weighted_venues_df.drop('neighborhood', 1)

km_cluster = KMeans(n_clusters=n_clusters, random_state=0)
km_cluster.fit(cluster_data_df)

km_cluster

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [22]:
# Add cluster labels to dummy_df_sorted

try:
    neighborhoods_by_venue_df.insert(0, "Cluster Label", km_cluster.labels_)
except:
    # Allow for eventulity wherein this cell is re-run and you have to drop the row before you add it
    neighborhoods_by_venue_df.drop(columns=['Cluster Label'], axis=1, inplace=True)
    neighborhoods_by_venue_df.insert(0, "Cluster Label", km_cluster.labels_)

merged_df = codes_df

# Remember, in code_df, "neighborhood" is "Neighborhood"
merged_df['neighborhood'] = merged_df['Neighborhood']
merged_df.drop(columns=['Neighborhood'], axis=1, inplace=True)

merged_df = pd.merge(merged_df, neighborhoods_by_venue_df, on="neighborhood")

In [23]:

merged_df.head()

Unnamed: 0,PostalCode,Borough,Latitude,Longitude,neighborhood,Cluster Label,No. 1 Most Common Venue,No. 2 Most Common Venue,No. 3 Most Common Venue,No. 4 Most Common Venue,No. 5 Most Common Venue,No. 6 Most Common Venue,No. 7 Most Common Venue,No. 8 Most Common Venue,No. 9 Most Common Venue,No. 10 Most Common Venue
0,M3A,North York,43.753259,-79.329656,Parkwoods,0,Park,Convenience Store,Food & Drink Shop,Yoga Studio,Dessert Shop,Event Space,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
1,M4A,North York,43.725882,-79.315572,Victoria Village,1,Portuguese Restaurant,French Restaurant,Intersection,Pizza Place,Coffee Shop,Hockey Arena,Yoga Studio,Dessert Shop,Dim Sum Restaurant,Diner
2,M5A,Downtown Toronto,43.65426,-79.360636,Regent Park / Harbourfront,1,Coffee Shop,Park,Breakfast Spot,Theater,Bakery,Performing Arts Venue,Restaurant,Café,Pub,Chocolate Shop
3,M6A,North York,43.718518,-79.464763,Lawrence Manor / Lawrence Heights,1,Clothing Store,Furniture / Home Store,Coffee Shop,Event Space,Miscellaneous Shop,Accessories Store,Boutique,Women's Store,Vietnamese Restaurant,Dim Sum Restaurant
4,M7A,Downtown Toronto,43.662301,-79.389494,Queen's Park / Ontario Provincial Government,1,Coffee Shop,Sushi Restaurant,Yoga Studio,Mexican Restaurant,Beer Bar,Burger Joint,Sandwich Place,Burrito Place,Café,Creperie


### Note:

The purpose of the above exercises was to do the following:

1. Find out what venues were most common/characteristic of each neighborhood (thus all the weighting and the creating of a DF with the "No. X Most Common")
2. Use the weights of those common/characteristic features to put each neighborhood into one of four clusters, via KMeans
3. Add the "No. X Most Common" Data and the cluster assignment to our original neighborhoods dataframe.

Now we can graph the same neighborhoods map, delineating points by cluster

## Part 4, Map the clusters

In [51]:
# Create the map and add markers

cluster_m = folium.Map(location=toronto_ll, zoom_start=11)

colors = ["red", "green", "blue", "black", "pink"]

# Add markers
for lat, lng, neigh, cid in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['neighborhood'], merged_df['Cluster Label']):
    label = f"<h3>{neigh}</h3> <p>{borough}</p> <p>Group {cid + 1}</p>"
    label = folium.Popup(label, parse_html=False)
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color=colors[cid],
        fill=True,
        fill_color="blue",
        fill_opacity=0.5,
        parse_html=True).add_to(cluster_m)

cluster_m

## Further Conclusions

Try to understand where each cluster is delineated.  Remember, we are dealing with an unsupervised algorithm, so we don't **really** know what each group is made up of.

In [33]:
# This loop will get the venues with the highest frequency by neighborhood

for neighborhood in weighted_venues_df['neighborhood']:
    print(f"---- {neighborhood} ----")
    temp = weighted_venues_df[weighted_venues_df['neighborhood'] == neighborhood].T.reset_index()
    temp.columns=["venue", "freq"]
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(5))
    print('\n')

       Gym   0.1
2        Skating Rink   0.1
3      Sandwich Place   0.1
4  Athletics & Sports   0.1


---- Bathurst Manor / Wilson Heights / Downsview North ----
           venue  freq
0           Bank  0.11
1    Coffee Shop  0.11
2    Bridal Shop  0.06
3  Shopping Mall  0.06
4     Restaurant  0.06


---- Bayview Village ----
                 venue  freq
0                 Bank  0.25
1   Chinese Restaurant  0.25
2  Japanese Restaurant  0.25
3                 Café  0.25
4    Accessories Store  0.00


---- Bedford Park / Lawrence Manor East ----
                     venue  freq
0              Coffee Shop  0.08
1               Restaurant  0.08
2           Sandwich Place  0.08
3       Italian Restaurant  0.08
4  Comfort Food Restaurant  0.04


---- Berczy Park ----
                venue  freq
0        Cocktail Bar  0.07
1         Coffee Shop  0.07
2  Seafood Restaurant  0.07
3            Beer Bar  0.07
4              Museum  0.03


---- Birch Cliff / Cliffside West ----
                   

In [50]:
# Get the mean of each venue frequency, grouped by cluster

weighted_clustered_df = weighted_venues_df.groupby("cid").mean().reset_index()

for cid in weighted_clustered_df['cid']:
    print(f"---- Cluster {cid + 1} ----")
    temp = weighted_clustered_df[weighted_clustered_df['cid'] == cid].T.reset_index()
    temp.columns=["venue", "freq"]
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(5))
    print('\n')

---- Cluster 1 ----
                        venue  freq
0                        Park  0.38
1           Convenience Store  0.13
2                  Playground  0.06
3                       Trail  0.06
4  Construction & Landscaping  0.06


---- Cluster 2 ----
                  venue  freq
0           Coffee Shop  0.06
1           Pizza Place  0.04
2                  Café  0.04
3  Fast Food Restaurant  0.03
4        Sandwich Place  0.03


---- Cluster 3 ----
               venue  freq
0          Cafeteria   1.0
1  Accessories Store   0.0
2      Movie Theater   0.0
3     Massage Studio   0.0
4     Medical Center   0.0


---- Cluster 4 ----
                 venue  freq
0  Filipino Restaurant   1.0
1        Movie Theater   0.0
2    Martial Arts Dojo   0.0
3       Massage Studio   0.0
4       Medical Center   0.0




## Notes

It looks as though we have some highly localized clusters.  We may consider any of the following:

1. Widening the scope of the venue search (leading to potential overlaps)
2. Increasing the number of possible clusters (leading to overfitting)
3. Removing neighborhoods further from the center of Toronto (where venues are more scarce)