# Battle of the Neighborhoods
## Venue recommendation by Subway Station in Montreal

### Table of Contents

##### Setup:
    A. Problem
    B. Background
    C. Data
##### Report:
    1. Introduction
    2. Data
    3. Methodology
    4. Results
    5. Discussion
    6. Conclusion
##### Appendix:
    a. Code
    b. Data Sets
    c. Miscelaneous

## **Setup**

### **A. Problem:**

Every summer, the influx of tourists for the festival season creates and incredible gridlock in the streets of Montreal. Much to the dismay of locals, very few people use the amazing public transport system to get from place to place, opting instead to drive around. I believe this is due to the lack of awareness of all the amazing restaurants, bars, theatres, and other points of interest that can easily be reached by subway in a matter of minutes.

### **B. Background:**

As a Montreal local, I have seen it every summer. Droves of tourists clutter the streets of the city, as they try to drive a few blocks in rush-hour traffic. It always boggles the mind that they would opt to spend 20 minutes in downtown traffic, instead of jumping on the subway for a few stops and getting to their destination in relative tranquility (the odd tipsy university student notwithstanding). Figuring it’s probably due to their lack of knowledge, I have decided to help them out by clustering and comparing the top venues near each of our subway stations. 

### **C. Data:**

We will be working mainly with two datasets for this project. 

First and foremost, we need geolocation coordinates for Montreal’s 68 subway (or Metro) stations. The source for these coordinates is the City of Montreal’s Open Data Portal (http://donnees.ville.montreal.qc.ca/dataset). More specifically we will be using their data set on “STM Bus and Subway lines (http://donnees.ville.montreal.qc.ca/dataset/stm-traces-des-lignes-de-bus-et-de-metro). Now unfortunately, they only make it available as a large .SHP Shapefile, so we are gong to have to do a lot of cleanup to make it workable.

Our second dataset will be the venues information queried from Foursquare using the geolocation coordinates obtained above. Instead of focusing on quantity (i.e. concentration of venues in a location), we will be focusing on quality (i.e. what are the top venues in a location). We are, after all, trying to convince tourists to use our world-class public transport system instead of contributing to the summer gridlock – and what better way than to guide them to the best Metro stations, with the best venues?


## **Appendix**

### **Code**

#### Start by importing the libraries we need for this project.

In [1]:
import pandas as pd
import geopandas as gpd
!pip install matplotlib -U
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import folium
from folium import plugins
import seaborn as sns
!pip install descartes -U
import descartes
import re

Requirement already up-to-date: matplotlib in c:\users\alexi\anaconda3\lib\site-packages (3.2.1)



Bad key "text.kerning_factor" on line 4 in
C:\Users\alexi\anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


Requirement already up-to-date: descartes in c:\users\alexi\anaconda3\lib\site-packages (1.1.0)


#### Because we're working with a Shapefile, we are using a geoPandas geoDataFrame. We will eventually revert back to a Pandas DataFrame for convenience's sake.

In [2]:
df = gpd.read_file('stm_arrets_sig.shp')
df.head()

Unnamed: 0,stop_id,stop_code,stop_name,stop_url,wheelchair,route_id,loc_type,service_id,geometry
0,43-01,10118,Station Angrignon,,2,,2,20M,POINT (296677.562 5034048.338)
1,43,10118,Station Angrignon,http://www.stm.info/fr/infos/reseaux/metro/ang...,2,1.0,0,20M,POINT (296733.669 5034064.602)
2,42-01,10120,Station Monk - Édicule Nord,,2,,2,20M,POINT (297515.753 5034601.626)
3,42-02,10120,Station Monk - Édicule Sud,,2,,2,20M,POINT (297496.004 5034568.310)
4,42,10120,Station Monk,http://www.stm.info/fr/infos/reseaux/metro/monk,2,1.0,0,20M,POINT (297506.817 5034585.078)


#### Let's do a little bit of cleanup. Having had a look-through the file, I noticed we could drop all of the rows with "None" for route_id as they are all duplicates. We're also going to drop some useless columns.

In [3]:
df.replace(r'None', np.nan, regex=True, inplace = True)
df.dropna(axis = 0, how = "any", inplace = True)
df.drop(['stop_id', 'stop_code', 'wheelchair', 'loc_type', 'service_id'], axis = 1, inplace = True)

#### Every subway station, and bus stop has an associated URL (it's used to look up subway and bus schedules). Let's create a new dataframe containing only those entries where the URL contains the word "metro" (as this indicates this is a subway station).

In [4]:
df_metro = df[df['stop_url'].str.contains('.*metro.*')]

#### Let's re-project the geometry data to a coordinate system we are more familiar with (and something that folium will work with without complaining too much).

In [None]:
df_metro = df_metro.to_crs(epsg='4326')

#### Now let's convert the geoDataFrame to a regular DataFrame, and cast the "geometry" column to string so that we may parse it with regex. We have to do this because we started with a geoDataFrame created from a shapefile.

In [5]:
df_metro = pd.DataFrame(df_metro)
df_metro['geometry'] = df_metro['geometry'].astype('str')
print(df_metro.dtypes)
df_metro.head()

stop_name    object
stop_url     object
route_id     object
geometry     object
dtype: object


Unnamed: 0,stop_name,stop_url,route_id,geometry
1,Station Angrignon,http://www.stm.info/fr/infos/reseaux/metro/ang...,1,POINT (-73.60311799999998 45.44646599999288)
4,Station Monk,http://www.stm.info/fr/infos/reseaux/metro/monk,1,POINT (-73.593242 45.45115799999289)
6,Station Jolicoeur,http://www.stm.info/fr/infos/reseaux/metro/jol...,1,POINT (-73.58169099999999 45.45700999999288)
9,Station Verdun,http://www.stm.info/fr/infos/reseaux/metro/verdun,1,POINT (-73.57202099999999 45.45944099999288)
12,Station De l'Église,http://www.stm.info/fr/infos/reseaux/metro/de-...,1,POINT (-73.56707400000001 45.46189399999288)


Now, let's parse the clunky database, and extract all of the useful information into a new dataframe. We will take this opportunity to clean up the geolocation coordinates and make something more usable.

In [6]:
df_metro_geo = pd.DataFrame()
df_metro_geo['stop'] = ''
df_metro_geo['lat'] = ''
df_metro_geo['lon'] = ''

In [7]:
for name, geometry in zip('df_metro.stop_name', 'df_metro.geometry'):
    df_metro_geo.stop = df_metro.stop_name
    df_metro_geo.lon = df_metro.geometry.str.extract(pat = r"(-[0-9][0-9].[0-9]*)")
    df_metro_geo.lat = df_metro.geometry.str.extract(pat = r"[0-9]\s([0-9][0-9].[0-9]*)")

In [8]:
df_metro_geo.head()

Unnamed: 0,stop,lat,lon
1,Station Angrignon,45.44646599999288,-73.60311799999998
4,Station Monk,45.45115799999289,-73.593242
6,Station Jolicoeur,45.45700999999288,-73.58169099999999
9,Station Verdun,45.45944099999288,-73.57202099999999
12,Station De l'Église,45.46189399999288,-73.567074


#### Finally, let's draw a map of Montreal, and use our newly cleaned up geolocation coordinates to mark all of the subway stations.

In [9]:
mtl_map = folium.Map(location = [45.52, -73.62], zoom_start = 12, tiles = 'stamenterrain')

for row in df_metro_geo.itertuples():
    mtl_map.add_child(folium.CircleMarker(location = [row.lat, row.lon],
                                         radius = 5,
                                         fill = True,
                                         fill_color = 'red',
                                         fill_opacity = 0.7,
                                         popup = row.stop))

mtl_map