# Neighbourhoods in Toronto 3

## Importing postcode data from Wikipedia

### Task outline

Use your Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M_, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

[See the course website]

To create the above dataframe:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

#### Update
For goodness sake!  I spent *hours* figuring out exactly how to extract the tags from the table on the current Wikipedia page.  Then I see that I can use a previous version of the page if I choose, someone suggested https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050._ which makes the whole thing trivial.
*Unimpressed*.

### Coding

In [1]:
#url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050.'

I will be using the lxml parser.  It's probably installed on your system but if not uncomment the next cell:

In [2]:
#!pip3 install lxml

In [3]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
import matplotlib as mpl
import numpy as np
print('Loading url... ', end='')
html = requests.get(url).text
print('done.\nParsing markup...', end='')
parsed = BeautifulSoup(html, 'lxml')
print('done.')

Loading url... done.
Parsing markup...done.


In [4]:
print('Extracting information... ', end='')

# Find the table (there's only one so 'find' is good enough)
table = parsed.find('table',{'class':'wikitable sortable'})

# Make a collection of separate rows
rows = table.find_all('tr')

# Lists to hold all the data we want
pcodes = []
boroughs = []
neighbourhoods = []

for row in rows[1:]:
    # Each row has three cells
    postcode, borough, neighbourhood = row.find_all('td')
    
    # Get rid of the bumf in each tag
    postcode = postcode.string
    borough = borough.string
    # Some neighbourhoods come out with newlines attached
    # Sometimes they are singletons but [0] still works as
    # these are not *actually* strs
    neighbourhood = str(list(neighbourhood.strings)[0]).rstrip()
    
    #Skip the row if there is no borough
    if (borough != 'Not assigned'):
        #Assign neighbourhood the borough name if none is assigned
        if (neighbourhood == 'Not assigned'):
            neighbourhood = borough
        pcodes.append(postcode)
        boroughs.append(borough)
        neighbourhoods.append(neighbourhood)
print('done.')

Extracting information... done.


Now we have the data in three lists it is time to scrape over each postcode extracting the borough and a list of all the neighbourhoods my structure is two dicts each of whose keys are ```postcode``` and the values are ```borough``` and ```[neighbourhoods]``` respectively.

In [5]:
# This cell is not idempotent as I decided to reuse variable names.
# Don't like it?  Sue me.
codes = pd.DataFrame({'borough':boroughs, 'neighbourhood':neighbourhoods},
                     index = pcodes)
postcodes = list(dict.fromkeys(pcodes).keys())
neighbourhoods={} #I'll have one df where these are single strings
hoodlists={}  #And one where they are lists
boroughs = {}
postalcodes = {}
for code in postcodes:
    postalcodes[code] = code
    # ._to_list() fails on a singleton neighbourhood so if it fails, catch
    # the exception and handle it as single neighbourhood.  It's a little
    # bit cleaner than a further if statement.
    try: #Multiple boroughs need Series -> list and one borough name
        hoodlists[code] = codes['neighbourhood'][code].to_list()
        neighbourhoods[code] = ', '.join(hoodlists[code])
        #The borough names will all be the same, so just choose the first
        boroughs[code] = codes['borough'][code].to_list()[0]
    except: #Single boroughs need item->[item] and the borough name
        hoodlists[code] = [codes['neighbourhood'][code]]
        neighbourhoods[code] = hoodlists[code][0]
        #Unlike above, here there will only be a single borough name
        boroughs[code] = codes['borough'][code]

#This is my final answer to Q1
final = pd.DataFrame({'PostalCode': postalcodes,
                      'Borough':boroughs,
                      'Neighbourhood':neighbourhoods})
#I may not need a list of neighbourhoods but it took no additional effort
final_lists = pd.DataFrame({'PostalCode': postalcodes, 
                          'Borough':boroughs,
                          'Neighbourhood':hoodlists})

In [6]:
# If your screen is not wide enough, just reduce this
pd.set_option('max_colwidth', 150)

In [7]:
# Reset the index to match the question format
display = final.reset_index(drop = True, inplace = False)
display

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park South East, Mimico NE, Old Mill South, The Queensway East, Royal York South East, Sunnylea"


In [8]:
print(f'The data frame has {final.shape[0]} rows.')

The data frame has 103 rows.


## Adding geolocation data
### Outline
Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data.

Use the Geocoder package or the csv file to create the following dataframe:

[See the course website]

Clearly I will use the csv file - it is much easier.

### Coding

So I will load the geospatial data which contains postcodes, and lat/lon data.  Indexing by the postal code should allow me to easily join the to tables.

In [9]:
# Where is the geospatial data - quick glance shows it is just a plain CSV
# with a header and no index.
url = 'http://cocl.us/Geospatial_data'

In [10]:
geo = pd.read_csv(url)
geo.set_index('Postal Code', inplace = True )
print(geo.shape)
geo.head(3)

(103, 2)


Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711


Nice - it looks like we have exactly the same number of entries.  This might actually work!

In [11]:
full = final.join(geo, how='inner')
full_lists = final_lists.join(geo, how='inner')
if full.shape[0] == geo.shape[0]:
    print("Successfully merged locations with lat/long data.")
    print(full.head())
else:
    print("***ERROR*** We did not match all the rows")

Successfully merged locations with lat/long data.
    PostalCode           Borough                     Neighbourhood   Latitude  \
M3A        M3A        North York                         Parkwoods  43.753259   
M4A        M4A        North York                  Victoria Village  43.725882   
M5A        M5A  Downtown Toronto                      Harbourfront  43.654260   
M6A        M6A        North York  Lawrence Heights, Lawrence Manor  43.718518   
M7A        M7A  Downtown Toronto                      Queen's Park  43.662301   

     Longitude  
M3A -79.329656  
M4A -79.315572  
M5A -79.360636  
M6A -79.464763  
M7A -79.389494  


In [12]:
# As before drop the index to have the same form as that in the question
# Reusing another variable;-)
display = full.reset_index(drop = True, inplace = False)
display

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park South East, Mimico NE, Old Mill South, The Queensway East, Royal York South East, Sunnylea",43.636258,-79.498509


## Analysis
### Outline
Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:
* to add enough Markdown cells to explain what you decided to do and to report any observations you make.
* to generate maps to visualize your neighborhoods and how they cluster together. 

### Initial thoughts
* Need to get info about attractions & most popular attractions in each location.
* Initially try k-means but use DBSCAN if it looks/feels very non-radial.
* When clustering try **without** location data!!!
* Write generally enough so that I can use either all the data or just 'Toronto'.
* Leave time to get good-looking maps

### Initial mapping

In [13]:
# Find out the 'centre' of Toronto & draw a map
import geocoder
import folium
g = geocoder.osm('Toronto, Canada')
lat, lng = g.latlng
#The centre of Toronto is too far South so increase latitude by 0.07 degrees
amap = folium.Map(location=[lat+0.07,lng], zoom_start=11)

#Pull data out of 'full'
for postal_code, borough, neighbourhood, lat, lng in \
    zip(full.PostalCode,
        full.Borough,
        full.Neighbourhood,
        full.Latitude,
        full.Longitude):
        
    label = f'{postal_code}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color='blue',
                        fill=False,
                        fill_color='#3388ee',
                        fill_opacity=0.7,

                        parse_html=False).add_to(amap)
amap

These are mostly a couple of km apart.  So I'll look for things within 1000m.

### Retrieving foursquare data


In [14]:
CLIENT_ID = 'PRGBUJ3LIMIY1IHX34QLZKNI5YMYDP5FW2AIHZDNNPISNFFA' # your Foursquare ID
CLIENT_SECRET = 'EDCQEEMN1YR4P4YBH5BZMKTVAYRABPMZDZYXBHHURKFWK4NK' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#### Try initially for one neighbourhood

In [15]:
# Just use the first neighbourhood in our list
neighbourhood_name = full.loc['M3A','Neighbourhood']
neighbourhood_lat = full.loc['M3A','Latitude']
neighbourhood_lon = full.loc['M3A','Longitude']
radius = 1000
limit = 100
fs_base = 'https://api.foursquare.com/v2/venues/explore?'
fs_creds = f'client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'
fs_query = f'&ll={neighbourhood_lat},{neighbourhood_lon}&radius={radius}&limit={limit}'
url = fs_base+fs_creds+fs_query
print(url)

https://api.foursquare.com/v2/venues/explore?client_id=PRGBUJ3LIMIY1IHX34QLZKNI5YMYDP5FW2AIHZDNNPISNFFA&client_secret=EDCQEEMN1YR4P4YBH5BZMKTVAYRABPMZDZYXBHHURKFWK4NK&v=20180605&ll=43.7532586,-79.3296565&radius=1000&limit=100


Well that should work.  Let's request it!
Then pull out the main dict - 'items' and flatten everything in it.

In [16]:
venues = json_normalize(requests.get(url).json()['response']['groups'][0]['items'])

Now strip irrelevancies out of ```venues``` keeping only name category and location data.  Then get the first category name in the category row and apply it as that row's category.

In [17]:
# clean columns
#nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

def get_category(row):
    categories_list = row['venue.categories']
    if not(len(categories_list)):
        return None
    return categories_list[0]['name']

keepcols = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
venues = venues.loc[:, keepcols]
venues['venue.categories'] = venues.apply(get_category, axis=1)
# I prefer 'category', 'lat' and 'lon' to the defaults names
venues.columns = ['name', 'category', 'lat', 'lon']

Well that seemed to work, so I'll make a function for it for a given postcode, lat and lon and keep every venue in the same dataframe.

In [18]:
def get_venues(postcodes, latitudes, longitudes, radius=1000, limit=100):
    venues_list = []
    print('Retrieving data for postcodes:')
    for pc, lat, lon in zip(postcodes, latitudes, longitudes):
        print(pc+'... ',end='')
        fs_query = f'&ll={lat},{lon}&radius={radius}&limit={limit}'
        url=fs_base+fs_creds+fs_query

        venues = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([[pc, lat, lon,
                           venue['venue']['name'],
                           venue['venue']['location']['lat'],
                           venue['venue']['location']['lng'],
                           venue['venue']['categories'][0]['name']]
                            for venue in venues])
        print('done.\t', end='')
        
    dataframe=[]
    for venue_list in venues_list:
        for item in venue_list:
            dataframe.append(item)
    dataframe = pd.DataFrame(dataframe)
    dataframe.columns = ['PostalCode','Latitude','Longitude', 'Venue',
                       'Venue Latitude', 'Venue Longitude', 'Category' ]
    return(dataframe)


full_list=get_venues(full['PostalCode'], full['Latitude'], full['Longitude'])

Retrieving data for postcodes:
M3A... done.	M4A... done.	M5A... done.	M6A... done.	M7A... done.	M9A... done.	M1B... done.	M3B... done.	M4B... done.	M5B... done.	M6B... done.	M9B... done.	M1C... done.	M3C... done.	M4C... done.	M5C... done.	M6C... done.	M9C... done.	M1E... done.	M4E... done.	M5E... done.	M6E... done.	M1G... done.	M4G... done.	M5G... done.	M6G... done.	M1H... done.	M2H... done.	M3H... done.	M4H... done.	M5H... done.	M6H... done.	M1J... done.	M2J... done.	M3J... done.	M4J... done.	M5J... done.	M6J... done.	M1K... done.	M2K... done.	M3K... done.	M4K... done.	M5K... done.	M6K... done.	M1L... done.	M2L... done.	M3L... done.	M4L... done.	M5L... done.	M6L... done.	M9L... done.	M1M... done.	M2M... done.	M3M... done.	M4M... done.	M5M... done.	M6M... done.	M9M... done.	M1N... done.	M2N... done.	M3N... done.	M4N... done.	M5N... done.	M6N... done.	M9N... done.	M1P... done.	M2P... done.	M4P... done.	M5P... done.	M6P... done.	M9P... done.	M1R... done.	M2R... done.	M4R... done.	M5R... 

Since we need quantitative data we'll do a one-hot encoding on categories.  Carefully consider the metric to use for similarities - Cosine distance?  Hamming?  Minkowski?

In [19]:
#First grab a list of all categories
categories = full_list.Category.unique()
# Get dummy variables
dummies = pd.get_dummies(full_list['Category'])
# and concatenate them in
encoded = pd.concat([full_list,dummies], axis=1)\
            .groupby(by='PostalCode').mean()
# should we drop the 'average' venue location?  Not yet.
# I might be interested to see how far the centre of all cool spots is from the centre of the postal code.

### Clustering

In [20]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

standardised = pd.DataFrame(StandardScaler().fit_transform(encoded),
                            columns = encoded.columns)

for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, n_jobs=4, random_state=0
                   ).fit(standardised)
    print(f'{k} clusters with counts:\
        {list(pd.Series(kmeans.labels_).value_counts())}.')
kmeans.labels_

2 clusters with counts:        [84, 18].
3 clusters with counts:        [80, 13, 9].
4 clusters with counts:        [45, 35, 16, 6].
5 clusters with counts:        [70, 13, 9, 7, 3].
6 clusters with counts:        [57, 27, 13, 2, 2, 1].
7 clusters with counts:        [72, 13, 7, 5, 3, 1, 1].


array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 5, 2,
       5, 2, 5, 2, 2, 2, 2, 2, 2, 0, 4, 4, 0, 0, 4, 4, 4, 2, 2, 2, 5, 5,
       5, 2, 4, 4, 2, 2, 2, 5, 5, 2, 3, 5, 1, 2, 2, 2, 5, 5, 2, 5, 2, 2,
       2, 5, 2, 2, 2, 2, 6, 2, 1, 2, 2, 2, 2, 2], dtype=int32)

In [21]:
g = geocoder.osm('Toronto, Canada')
lat, lng = g.latlng

4 clusters seems appropriate.  Let's visualise them.

In [22]:
import math
def map_clusters(postcodes, neighbourhoods, latitudes,
                 longitudes, cluster_labels):
    k = len(cluster_labels.unique())
    amap = folium.Map(location=[lat+0.07,lng], zoom_start=11)
    
    # set color scheme for the clusters
    x = np.arange(k)
    ys = [i + x + (i*x)**2 for i in range(k)]
    colors_array = mpl.cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [mpl.colors.rgb2hex(i) for i in colors_array]
    
    invalid=0
    for pc, nei, lt, ln, lab in zip(postcodes, neighbourhoods, latitudes,\
                                      longitudes, cluster_labels):
        if not math.isnan(lab):
            folium.CircleMarker([lt,ln],
                            radius = 5,
                            popup = f'{nei}\nCluster:{int(lab)}',
                            color=rainbow[int(lab)-1],
                            fill=True,
                            fill_color=rainbow[int(lab)-1],
                            fill_opacity = 0.6).add_to(amap)
    return amap

In [23]:
kmeans = KMeans(n_clusters=4, n_jobs=4, random_state=0).fit(standardised)
#Add cluster labels to full
full['labels'] = pd.Series(kmeans.labels_, index = encoded.index)
map_clusters(full.PostalCode, full.Neighbourhood,
             full.Latitude, full.Longitude, full.labels)

Neverthless the real problem is, even after standardization, location data is being used.  Those downtown areas are always going to be close together!
They *may* still be similar to each other but in ways other than location.

In [24]:
rs = 42
for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, n_jobs=4, random_state=rs, n_init=100)\
        .fit(standardised.drop(['Latitude',
                                'Longitude',
                                'Venue Latitude',
                                'Venue Longitude'],
                               axis = 1))

    print(f'{k} clusters with counts: \
          {list(pd.Series(kmeans.labels_).value_counts())}.')

2 clusters with counts:           [84, 18].
3 clusters with counts:           [86, 11, 5].
4 clusters with counts:           [76, 13, 10, 3].
5 clusters with counts:           [73, 18, 8, 2, 1].
6 clusters with counts:           [47, 39, 10, 3, 2, 1].
7 clusters with counts:           [47, 38, 8, 3, 3, 2, 1].


Running this repeatedly usually led to k=3 (and occasionally k=4) giving the most useful clustering numbers.  More than this led to lots of 1- or 2-node clusters which probably means the algorithm is settling on points that are not really clusters but k-means **will** put them somewhere.  For example look at the 7 clusters - it looks like 2 or 3 real clusters and 4 k-means artefacts.

Lets try 4 clusters and visualise

In [25]:
kmeans = KMeans(n_clusters=4, n_jobs=4, random_state=rs, n_init=100)\
    .fit(standardised.drop(['Latitude',
                            'Longitude',
                            'Venue Latitude',
                            'Venue Longitude'],
                           axis = 1))
#Add cluster labels to full
full['labels'] = pd.Series(kmeans.labels_, index = encoded.index)
map_clusters(full.PostalCode, full.Neighbourhood,
             full.Latitude, full.Longitude, full.labels)

Analysis of clusters:

#### Cluster 0 (red)
These are mostly more central areas of the city just outside the downtown area - **the inner city**.

#### Cluster 1 (purple)
These are the outer areas of the city - **the suburbs**.

#### Cluster 2 (turquoise)
These are areas around the University - **the student area**.

#### Cluster 3 (light green)
This is the central area - **downtown** Toronto.
(They did turn out to be similar after all)