# Battle of the Neighborhoods
## Venue recommendation by Subway Station in Montreal

### Table of Contents

##### Setup:
    A. Problem
    B. Background
    C. Data
##### Report:
    1. Introduction
    2. Data
    3. Methodology
    4. Results
    5. Discussion
    6. Conclusion
##### Appendix:
    a. Code

## **Setup**

### **A. Problem:**

Every summer, the influx of tourists for the festival season creates and incredible gridlock in the streets of Montreal. Much to the dismay of locals, very few people use the amazing public transport system to get from place to place, opting instead to drive around. I believe this is due to the lack of awareness of all the amazing restaurants, bars, theatres, and other points of interest that can easily be reached by subway in a matter of minutes.

### **B. Background:**

As a Montreal local, I have seen it every summer. Droves of tourists clutter the streets of the city, as they try to drive a few blocks in rush-hour traffic. It always boggles the mind that they would opt to spend 20 minutes in downtown traffic, instead of jumping on the subway for a few stops and getting to their destination in relative tranquility (the odd tipsy university student notwithstanding). Figuring it’s probably due to their lack of knowledge, I have decided to help them out by clustering and comparing the top venues near each of our subway stations. 

### **C. Data:**

We will be working mainly with two datasets for this project. 

First and foremost, we need geolocation coordinates for Montreal’s 68 subway (or Metro) stations. The source for these coordinates is the City of Montreal’s Open Data Portal (http://donnees.ville.montreal.qc.ca/dataset). More specifically we will be using their data set on “STM Bus and Subway lines (http://donnees.ville.montreal.qc.ca/dataset/stm-traces-des-lignes-de-bus-et-de-metro). Now unfortunately, they only make it available as a large .SHP Shapefile, so we are gong to have to do a lot of cleanup to make it workable.

Our second dataset will be the venues information queried from Foursquare using the geolocation coordinates obtained above. Instead of focusing on quantity (i.e. concentration of venues in a location), we will be focusing on quality (i.e. what are the top venues in a location). We are, after all, trying to convince tourists to use our world-class public transport system instead of contributing to the summer gridlock – and what better way than to guide them to the best Metro stations, with the best venues?


## **Report**

### **1. Introduction**

Montreal is an international destination for tourism. Every summer, tens of thousands of tourists from every corner of the world flock to the city for its renowned music, theatre, comedy, and arts festivals. Being quite unfamiliar with the city, most tend to rent a car for the duration of their stay, thus exacerbating the city’s already terrible traffic problems. As a result, what started off as a pleasant holiday in one of Canada’s most scenic cities turns into a stressful experience in bumper-to-bumper traffic. Little do they know that, instead of spending hours in traffic, they could hop on the city’s well-developed subway system & get to their destination in record time. With this project, we aim to educate and inform these newcomers in the fine art of getting around the city via public transport.

### **2. Data**

In order to address this problem, we will need to gather up data from a few data sources. 

First and foremost, we need to identify Montreal's 68 subway stations, and obtain their geolocation coordinates. This is easier said than done, as the loca traansit company operating Montreal's subway stations seems to be quite stingy with their data. Instead, we have to turn to the City of Montreal's Open Data Portal, which keeps a database of all the bus and subway stops on the island. We can retrieve that data from the Open Access portal, in the form of a .shp shapefile, which will need some processing before we get any useful data. The goal here, is to obtain a list of all the subway stations and their longitude and latitude coordinates.

Afterwards, we need to obtain a list of the top / featured venues for each specific area. Looking througn the Foursquare API documentation, we notice that we can append our requests to only return the "top picks" for a set of coordinates. Furthermore, we can also request that the data returned is sorted by popularity. We will be taking advantage of both these options to obtain the data we are looking for. Foursquare also tracks "venues" that might not be relevant for our purposes of helping out the tourists. Such venues include: colleges, hospitals, office buildings, etc. Seeing as we are not interested in some venue categories, we will certainly have to find a way to filter these out of the dataframe.

### **3. Methodology**

First things first, we have to wrangle the data obtained from Montreal's Open Data Portal into a format we can work with. As mentioned before, the dataset obtained from the city contains every single subway and bus stop on the island, along with a plethora of information we are absolutely not interested in. Since we are working with a .shp shapefile, we have to use GeoPandas to get the information into a dataframe. While mostly behaving the same was a Pandas, it dose have some issues parsing string objects with regex in the dataframe, so eventually we are going to want to move the data back into a Pandas dataframe. Starting off with the Geopandas dataframe, we can drop the columns we are not interested in. Following that, we have to re-encode the geolocation coordinates to something we can pass to folium and Foursquare. By default, the geolocation information is provided under the NAD83 MTM8 format, and we want to re-encode it to WGS84. This can eaisly be accomplished within GeoPandas. Following that, we can easily extract the informaiton we want using some simple regex espressions.

As for the Foursquare data, as mentioned above, we are going to be querying Top Picks locations, and have them sorted by popularity. This should return us with a dataframe of the top 5 or 10 best venues near each location. Aftwrwards, we are going to be dropping some venues compiled by Foursquare, that aren't exactly tourist destinations. Such venues could include schools, universities, playgorunds, office building a more (the full list of dropped venues is available in the code - obviously). Once we have the venues list, we are going to apply cluster the subway stations using k-means clustering and hopefully end up with 3 clearly defined clusters, based on the kind of venues in the area.

### **4. Results**

### **5. Discussion**

As we can observe from our results, we have succesfully clustered the different subway stations into three vaguely distinc clusters. The first cluster (0 - yellow) contains stations that are mostly in residential areas. As such, most of the venues that surround the subway station are corner stores, coffee shops, and small fast-food restaurant and stores. These stations would offer limited interest to tourists, as the venues are aimed at supplying the surrounding residential area, rather than catering to tourists. The second cluster (1 - red) contains subway stations that are of interest to toursits. Most of the venues in this cluster offer outdoor activities, theatres, concert venues, and other (mostly) non-restaurant venues. The third cluster (2 - blue) has caught the remaining stations, and is comprised of venues such as restaurants, and stores that might have some interests for tourists.

### **6. Conclusion**

In this project, we have identified a problem that could be solved through Data Science, and some applied clustering algorithms. We cleaned up some data retreived from various sources and processed it until it could be used to identify the top venues near each of Montreal's subway stations. Hopefully, it can act as a reference point for tourists to the city. Although it fails at direct people to specific venues, it does give an overview of just how many places can be reached by public transport in as little as 15-20 minutes rather than contributing to the traffic on the roads. We were able to cluster the different stations based on the kind of venues in their immediate surrounding. 

## **Appendix**

### **a. Code**

#### Start by importing the libraries we need for this project.

In [1]:
import pandas as pd
import geopandas as gpd
!pip install matplotlib -U
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import folium
from folium import plugins
import seaborn as sns
!pip install descartes -U
import descartes
import re
import requests

Requirement already up-to-date: matplotlib in c:\users\alexi\anaconda3\lib\site-packages (3.2.1)



Bad key "text.kerning_factor" on line 4 in
C:\Users\alexi\anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


Requirement already up-to-date: descartes in c:\users\alexi\anaconda3\lib\site-packages (1.1.0)


#### Because we're working with a Shapefile, we are using a geoPandas geoDataFrame. We will eventually revert back to a Pandas DataFrame for convenience's sake.

In [2]:
df = gpd.read_file('stm_arrets_sig.shp')
df.head()

Unnamed: 0,stop_id,stop_code,stop_name,stop_url,wheelchair,route_id,loc_type,service_id,geometry
0,43-01,10118,Station Angrignon,,2,,2,20M,POINT (296677.562 5034048.338)
1,43,10118,Station Angrignon,http://www.stm.info/fr/infos/reseaux/metro/ang...,2,1.0,0,20M,POINT (296733.669 5034064.602)
2,42-01,10120,Station Monk - Édicule Nord,,2,,2,20M,POINT (297515.753 5034601.626)
3,42-02,10120,Station Monk - Édicule Sud,,2,,2,20M,POINT (297496.004 5034568.310)
4,42,10120,Station Monk,http://www.stm.info/fr/infos/reseaux/metro/monk,2,1.0,0,20M,POINT (297506.817 5034585.078)


#### Let's do a little bit of cleanup. Having had a look-through the file, I noticed we could drop all of the rows with "None" for route_id as they are all duplicates. We're also going to drop some useless columns.

In [3]:
df.replace(r'None', np.nan, regex=True, inplace = True)
df.dropna(axis = 0, how = "any", inplace = True)
df.drop(['stop_id', 'stop_code', 'wheelchair', 'loc_type', 'service_id'], axis = 1, inplace = True)

#### Every subway station, and bus stop has an associated URL (it's used to look up subway and bus schedules). Let's create a new dataframe containing only those entries where the URL contains the word "metro" (as this indicates this is a subway station).

In [4]:
df_metro = df[df['stop_url'].str.contains('.*metro.*')]

#### Let's re-project the geometry data to a coordinate system we are more familiar with (and something that folium will work with without complaining too much).

In [5]:
df_metro = df_metro.to_crs(epsg='4326')

#### Now let's convert the geoDataFrame to a regular DataFrame, and cast the "geometry" column to string so that we may parse it with regex. We have to do this because we started with a geoDataFrame created from a shapefile.

In [6]:
df_metro = pd.DataFrame(df_metro)
df_metro['geometry'] = df_metro['geometry'].astype('str')
print(df_metro.dtypes)
df_metro.head()

stop_name    object
stop_url     object
route_id     object
geometry     object
dtype: object


Unnamed: 0,stop_name,stop_url,route_id,geometry
1,Station Angrignon,http://www.stm.info/fr/infos/reseaux/metro/ang...,1,POINT (-73.60311799999998 45.44646599999288)
4,Station Monk,http://www.stm.info/fr/infos/reseaux/metro/monk,1,POINT (-73.593242 45.45115799999289)
6,Station Jolicoeur,http://www.stm.info/fr/infos/reseaux/metro/jol...,1,POINT (-73.58169099999999 45.45700999999288)
9,Station Verdun,http://www.stm.info/fr/infos/reseaux/metro/verdun,1,POINT (-73.57202099999999 45.45944099999288)
12,Station De l'Église,http://www.stm.info/fr/infos/reseaux/metro/de-...,1,POINT (-73.56707400000001 45.46189399999288)


Now, let's parse the clunky database, and extract all of the useful information into a new dataframe. We will take this opportunity to clean up the geolocation coordinates and make something more usable.

In [7]:
df_metro_geo = pd.DataFrame()
df_metro_geo['stop'] = ''
df_metro_geo['lat'] = ''
df_metro_geo['lon'] = ''

In [8]:
for name, geometry in zip('df_metro.stop_name', 'df_metro.geometry'):
    df_metro_geo.stop = df_metro.stop_name
    df_metro_geo.lon = df_metro.geometry.str.extract(pat = r"(-[0-9][0-9].[0-9]*)")
    df_metro_geo.lat = df_metro.geometry.str.extract(pat = r"[0-9]\s([0-9][0-9].[0-9]*)")

In [9]:
df_metro_geo.head()

Unnamed: 0,stop,lat,lon
1,Station Angrignon,45.44646599999288,-73.60311799999998
4,Station Monk,45.45115799999289,-73.593242
6,Station Jolicoeur,45.45700999999288,-73.58169099999999
9,Station Verdun,45.45944099999288,-73.57202099999999
12,Station De l'Église,45.46189399999288,-73.567074


#### Finally, let's draw a map of Montreal, and use our newly cleaned up geolocation coordinates to mark all of the subway stations.

In [10]:
mtl_map = folium.Map(location = [45.52, -73.62], zoom_start = 12, tiles = 'stamenterrain')

for row in df_metro_geo.itertuples():
    mtl_map.add_child(folium.CircleMarker(location = [row.lat, row.lon],
                                         radius = 5,
                                         fill = True,
                                         fill_color = 'red',
                                         fill_opacity = 0.7,
                                         popup = row.stop))

mtl_map

#### Ok, now let's make some queries to Foursquare using our subway stations coordinates. Couple of things to note - we're going to limit the number of venues to 10 per location, and have Foursquare return only the 5 most popular venues (passing "sortByPopularity = 1, and topPicks).

In [11]:
import config as cfg

CLIENT_ID = cfg.client_id
CLIENT_SECRET = cfg.client_secret
VERSION = "20200426"

In [12]:
radius = 1000
LIMIT = 10

venues = []

for stop, lat, lon in zip(df_metro_geo['stop'], df_metro_geo['lat'], df_metro_geo['lon']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&sortByPopularity=1&section=topPicks".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        lon,
        radius, 
        LIMIT)

    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            stop,
            lat, 
            lon, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

#### Dump the returned venues into a Pandas DataFrame, & add some column headers. Note that some locations have less than 5 venues. That's normal as these are periphery stations where there's little other than the subway station and a park.

In [13]:
venues_df = pd.DataFrame(venues)
venues_df.columns = ['stop', 'Lat', 'Lon', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

venues_df.head()

Unnamed: 0,stop,Lat,Lon,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Station Angrignon,45.44646599999288,-73.60311799999998,Carrefour Angrignon,45.44795,-73.615155,Shopping Mall
1,Station Angrignon,45.44646599999288,-73.60311799999998,Parc Angrignon,45.443001,-73.603334,Park
2,Station Angrignon,45.44646599999288,-73.60311799999998,allô! mon coco,45.448993,-73.609534,Breakfast Spot
3,Station Angrignon,45.44646599999288,-73.60311799999998,Dilallo Burger,45.450364,-73.598175,Deli / Bodega
4,Station Angrignon,45.44646599999288,-73.60311799999998,Sports Experts,45.44628,-73.614556,Sporting Goods Shop


#### Now, let's have a look at the caregories we picked up.

In [14]:
print(venues_df['VenueCategory'].unique())

['Shopping Mall' 'Park' 'Breakfast Spot' 'Deli / Bodega'
 'Sporting Goods Shop' 'Discount Store' 'Furniture / Home Store'
 'Smoke Shop' 'Liquor Store' 'Clothing Store' 'Comedy Club' 'Restaurant'
 'Automotive Shop' 'Food & Drink Shop' 'Pizza Place' 'BBQ Joint'
 'Chinese Restaurant' 'Beer Store' 'Beer Bar' 'Grocery Store' 'Trail'
 'Café' 'Convenience Store' 'Italian Restaurant'
 'Middle Eastern Restaurant' 'Fast Food Restaurant' 'Market' 'Canal'
 'Bakery' 'Gym' 'Cheese Shop' 'Office' 'Movie Theater' 'Department Store'
 'Gourmet Shop' 'Supermarket' 'Japanese Restaurant' 'Burger Joint'
 'Museum' 'Bagel Shop' 'Bookstore' 'Church' 'Monument / Landmark' 'Plaza'
 'Skating Rink' 'Performing Arts Venue' 'Hotel' 'Indie Movie Theater'
 'Record Shop' 'Hostel' 'Pub' 'Gay Bar' 'Asian Restaurant' 'Coffee Shop'
 'Gastropub' 'Hot Dog Joint' 'Fish Market' 'Health Food Store'
 'Recreation Center' 'Bar' 'Farmers Market' 'French Restaurant'
 'Sandwich Place' 'College Gym' 'Vegetarian / Vegan Restaurant'
 'P

#### We notice there's some distinctly non-touristy venues. Would someone visiting Montreal really care about residential buildings? Or discount stores? Let's remove those from our dataframe.

In [15]:
venues_df = venues_df[~venues_df['VenueCategory'].isin(['Playground', 'Grocery Store', 'Convenience Store', 'Supermarket', 'Office', 'Electronics Store', 'College Gym', 'Automotive Shop', 'Auto Dealership', 'Building', 'Coworking Space', 'Pharmacy', 'College Cafeteria', 'Metro Station', 'Clothing Store', 'Furniture / Home Store', 'Department Store', 'Gym / Fitness Center', 'Residential Building (Apartment / Condo)', 'Discount Store'])]

In [16]:
print(venues_df['VenueCategory'].unique())

['Shopping Mall' 'Park' 'Breakfast Spot' 'Deli / Bodega'
 'Sporting Goods Shop' 'Smoke Shop' 'Liquor Store' 'Comedy Club'
 'Restaurant' 'Food & Drink Shop' 'Pizza Place' 'BBQ Joint'
 'Chinese Restaurant' 'Beer Store' 'Beer Bar' 'Trail' 'Café'
 'Italian Restaurant' 'Middle Eastern Restaurant' 'Fast Food Restaurant'
 'Market' 'Canal' 'Bakery' 'Gym' 'Cheese Shop' 'Movie Theater'
 'Gourmet Shop' 'Japanese Restaurant' 'Burger Joint' 'Museum' 'Bagel Shop'
 'Bookstore' 'Church' 'Monument / Landmark' 'Plaza' 'Skating Rink'
 'Performing Arts Venue' 'Hotel' 'Indie Movie Theater' 'Record Shop'
 'Hostel' 'Pub' 'Gay Bar' 'Asian Restaurant' 'Coffee Shop' 'Gastropub'
 'Hot Dog Joint' 'Fish Market' 'Health Food Store' 'Recreation Center'
 'Bar' 'Farmers Market' 'French Restaurant' 'Sandwich Place'
 'Vegetarian / Vegan Restaurant' 'Portuguese Restaurant' 'Ice Cream Shop'
 'Poutine Place' 'Athletics & Sports' 'Garden' 'Sports Club'
 'Greek Restaurant' 'Planetarium' 'Golf Course' 'Thai Restaurant'
 'Donu

In [17]:
venues_df.head()

Unnamed: 0,stop,Lat,Lon,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Station Angrignon,45.44646599999288,-73.60311799999998,Carrefour Angrignon,45.44795,-73.615155,Shopping Mall
1,Station Angrignon,45.44646599999288,-73.60311799999998,Parc Angrignon,45.443001,-73.603334,Park
2,Station Angrignon,45.44646599999288,-73.60311799999998,allô! mon coco,45.448993,-73.609534,Breakfast Spot
3,Station Angrignon,45.44646599999288,-73.60311799999998,Dilallo Burger,45.450364,-73.598175,Deli / Bodega
4,Station Angrignon,45.44646599999288,-73.60311799999998,Sports Experts,45.44628,-73.614556,Sporting Goods Shop


#### Get dummy variables.

In [18]:
venues_oh = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

venues_oh.head()

Unnamed: 0,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Beer Bar,...,Sushi Restaurant,Thai Restaurant,Theme Park Ride / Attraction,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
venues_oh.insert(0, 'stop', venues_df['stop'])

central_venues = venues_oh.groupby(["stop"]).mean().reset_index()

In [20]:
central_venues.head()

Unnamed: 0,stop,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,...,Sushi Restaurant,Thai Restaurant,Theme Park Ride / Attraction,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Yoga Studio
0,Station Acadie,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,...,0.111111,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.111111,0.0
1,Station Angrignon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Station Assomption,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Station Atwater,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Station Beaubien,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0


In [21]:
areaColumns = ['stop']
freqColumns = []
for ind in np.arange(5):
    freqColumns.append('Top {}'.format(ind+1))

columns = areaColumns+freqColumns

top5_venues = pd.DataFrame(columns=columns)
top5_venues['stop'] = central_venues['stop']

top5_venues.head()

Unnamed: 0,stop,Top 1,Top 2,Top 3,Top 4,Top 5
0,Station Acadie,,,,,
1,Station Angrignon,,,,,
2,Station Assomption,,,,,
3,Station Atwater,,,,,
4,Station Beaubien,,,,,


#### Get top 5 venues for each location.

In [22]:
for ind in np.arange(central_venues.shape[0]):
    row_categories = central_venues.iloc[ind, :].iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    top5_venues.iloc[ind, 1:] = row_categories_sorted.index.values[0:5]

In [23]:
top5_venues.head()

Unnamed: 0,stop,Top 1,Top 2,Top 3,Top 4,Top 5
0,Station Acadie,Greek Restaurant,Sushi Restaurant,Bakery,Sandwich Place,North Indian Restaurant
1,Station Angrignon,Shopping Mall,Deli / Bodega,Park,Smoke Shop,Liquor Store
2,Station Assomption,Pizza Place,Italian Restaurant,Greek Restaurant,Golf Course,Planetarium
3,Station Atwater,Pizza Place,Burger Joint,Gourmet Shop,Bagel Shop,Museum
4,Station Beaubien,Beer Store,Poutine Place,Trail,Bakery,Farmers Market


#### Import and run k-means clustering on the venues.

In [24]:
from sklearn.cluster import KMeans

In [25]:
cluster_df = central_venues.drop(["stop"], axis = 1)

kmeans = KMeans(n_clusters=3).fit(cluster_df)

In [26]:
cluster_df = df_metro_geo.copy()
cluster_df = pd.merge(cluster_df, central_venues, on=['stop'], how='inner')
cluster_df.head()

Unnamed: 0,stop,lat,lon,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,...,Sushi Restaurant,Thai Restaurant,Theme Park Ride / Attraction,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Yoga Studio
0,Station Angrignon,45.44646599999288,-73.60311799999998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Station Monk,45.45115799999289,-73.593242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Station Jolicoeur,45.45700999999288,-73.58169099999999,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Station Verdun,45.45944099999288,-73.57202099999999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0
4,Station De l'Église,45.46189399999288,-73.567074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0


#### Add in the cluster labels at the begining of the dataframe.

In [27]:
cluster_df["Cluster_labels"] = kmeans.labels_
cluster_df = cluster_df.join(top5_venues.set_index("stop"), on="stop")
cluster_df.head()

Unnamed: 0,stop,lat,lon,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,...,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Yoga Studio,Cluster_labels,Top 1,Top 2,Top 3,Top 4,Top 5
0,Station Angrignon,45.44646599999288,-73.60311799999998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1,Shopping Mall,Deli / Bodega,Park,Smoke Shop,Liquor Store
1,Station Monk,45.45115799999289,-73.593242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1,Breakfast Spot,Comedy Club,Deli / Bodega,Park,Dessert Shop
2,Station Jolicoeur,45.45700999999288,-73.58169099999999,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,Breakfast Spot,Park,Food & Drink Shop,Chinese Restaurant,Restaurant
3,Station Verdun,45.45944099999288,-73.57202099999999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,Park,Breakfast Spot,Trail,Beer Bar,Beer Store
4,Station De l'Église,45.46189399999288,-73.567074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1,Breakfast Spot,Café,Italian Restaurant,Beer Store,Beer Bar


In [28]:
cluster_df.sort_values(["Cluster_labels"], inplace=True)

mid = cluster_df['Cluster_labels']
cluster_df.drop(labels=['Cluster_labels'], axis=1,inplace = True)
cluster_df.insert(0, 'Cluster_labels', mid)
cluster_df = cluster_df[['Cluster_labels', 'stop', 'lat', 'lon', 'Top 1', 'Top 2', 'Top 3', 'Top 4', 'Top 5']]

cluster_df = cluster_df.reset_index(drop=True)
print(cluster_df.shape)
cluster_df.head()

(72, 9)


Unnamed: 0,Cluster_labels,stop,lat,lon,Top 1,Top 2,Top 3,Top 4,Top 5
0,0,Station Peel,45.50087899999283,-73.57471499999998,Plaza,Bookstore,Japanese Restaurant,Monument / Landmark,Movie Theater
1,0,Station Snowdon 2,45.48543299999285,-73.62773,Portuguese Restaurant,Gym,Pizza Place,Deli / Bodega,Bakery
2,0,Station Berri-UQAM 2,45.51521199999282,-73.561051,Beer Bar,Gay Bar,Hostel,Record Shop,Pub
3,0,Station Cartier,45.55999699999278,-73.68230999999997,Park,Coffee Shop,Video Store,Hockey Arena,Sushi Restaurant
4,0,Station Sherbrooke,45.51905999999282,-73.56921699999998,Park,Portuguese Restaurant,Poutine Place,Deli / Bodega,Record Shop


#### Let's define some colors for these clusters.

In [29]:
color_list = cluster_df["Cluster_labels"]
color_df = pd.DataFrame(color_list)
color_df.rename(columns = {'Cluster_labels':'colors'}, inplace = True)

In [30]:
color_df["colors"] = color_df["colors"].replace(0, 'yellow')
color_df["colors"] = color_df["colors"].replace(1, 'red')
color_df["colors"] = color_df["colors"].replace(2, 'blue')
color_df["colors"] = color_df["colors"].replace(3, 'green')
color_df["colors"] = color_df["colors"].replace(4, 'purple')
cluster_df.insert(0, 'colors', color_df)

#### And now, let's have a look at a map of these clusters.

In [31]:
mtl_venue_map = folium.Map(location = [45.52, -73.62], zoom_start = 12, tiles = 'stamenterrain')

for row in cluster_df.itertuples():
    mtl_venue_map.add_child(folium.CircleMarker(location = [row.lat, row.lon],
                                  color = row.colors,
                                  fill = True,
                                  fill_color = row.colors,
                                  fill_opacity = 0.5,
                                  popup = [("stop:", row.stop), ("Cluster:", row.Cluster_labels)]))

mtl_venue_map

In [32]:
cluster_df.loc[cluster_df['Cluster_labels'] == 0, cluster_df.columns[[1] + [2] + list(range(5, cluster_df.shape[1]))]]

Unnamed: 0,Cluster_labels,stop,Top 1,Top 2,Top 3,Top 4,Top 5
0,0,Station Peel,Plaza,Bookstore,Japanese Restaurant,Monument / Landmark,Movie Theater
1,0,Station Snowdon 2,Portuguese Restaurant,Gym,Pizza Place,Deli / Bodega,Bakery
2,0,Station Berri-UQAM 2,Beer Bar,Gay Bar,Hostel,Record Shop,Pub
3,0,Station Cartier,Park,Coffee Shop,Video Store,Hockey Arena,Sushi Restaurant
4,0,Station Sherbrooke,Park,Portuguese Restaurant,Poutine Place,Deli / Bodega,Record Shop
5,0,Station Saint-Laurent,Plaza,Indie Movie Theater,Pizza Place,Performing Arts Venue,Movie Theater
6,0,Station Jean-Drapeau,French Restaurant,Historic Site,Deli / Bodega,Racetrack,Park
7,0,Station LongueuilUniversité-de-Sherbrooke,Theme Park Ride / Attraction,Shopping Mall,Bakery,Restaurant,Middle Eastern Restaurant
8,0,Station Viau,Athletics & Sports,Bakery,Park,Farmers Market,Sports Club
9,0,Station Pie-IX,Athletics & Sports,Bakery,Performing Arts Venue,Park,Farmers Market


In [33]:
cluster_df.loc[cluster_df['Cluster_labels'] == 1, cluster_df.columns[[1] + [2] + list(range(5, cluster_df.shape[1]))]]

Unnamed: 0,Cluster_labels,stop,Top 1,Top 2,Top 3,Top 4,Top 5
16,1,Station Place-d'Armes,Plaza,Beer Bar,Performing Arts Venue,Cocktail Bar,Church
17,1,Station Square-VictoriaOACI,Church,Historic Site,Bookstore,Performing Arts Venue,Movie Theater
18,1,Station Vendôme,Coffee Shop,Bakery,Park,Pub,Café
19,1,Station Mont-Royal,Park,Bakery,Portuguese Restaurant,Poutine Place,Bagel Shop
20,1,Station Angrignon,Shopping Mall,Deli / Bodega,Park,Smoke Shop,Liquor Store
21,1,Station Henri-Bourassa,Park,Deli / Bodega,Dog Run,Bakery,Organic Grocery
22,1,Station Rosemont,Café,Bakery,Poutine Place,Bagel Shop,Park
23,1,Station Jean-Talon 5,Bakery,Dessert Shop,Athletics & Sports,Park,Farmers Market
24,1,Station Jean-Talon 2,Bakery,Dessert Shop,Athletics & Sports,Park,Farmers Market
25,1,Station De la Concorde,Concert Hall,Breakfast Spot,Coffee Shop,Bakery,Food Court


In [34]:
cluster_df.loc[cluster_df['Cluster_labels'] == 2, cluster_df.columns[[1] + [2] + list(range(5, cluster_df.shape[1]))]]

Unnamed: 0,Cluster_labels,stop,Top 1,Top 2,Top 3,Top 4,Top 5
58,2,Station Outremont,Coffee Shop,Bakery,Sandwich Place,North Indian Restaurant,Park
59,2,Station Guy-Concordia,Movie Theater,Plaza,Monument / Landmark,Gym,Church
60,2,Station Jarry,Portuguese Restaurant,Bakery,Park,Fast Food Restaurant,Brewery
61,2,Station Montmorency,Concert Hall,Bookstore,Record Shop,Shopping Mall,Coffee Shop
62,2,Station Sauvé,Pizza Place,Burger Joint,Park,Organic Grocery,Baseball Field
63,2,Station Crémazie,Vietnamese Restaurant,Brewery,Athletics & Sports,Park,Fast Food Restaurant
64,2,Station Beaubien,Beer Store,Poutine Place,Trail,Bakery,Farmers Market
65,2,Station Champ-de-Mars,Plaza,Historic Site,Café,Gym,Movie Theater
66,2,Station Radisson,Pizza Place,Shopping Mall,Pet Store,Restaurant,Arts & Crafts Store
67,2,Station Bonaventure,Plaza,Burger Joint,Gym,Park,Church
