# Segmenting and Clustering Neighborhoods in Toronto
## P2P notebook
### Applied Data Science Capstone - Week 3

Source data: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

# First part

In [3]:
import pandas as pd
import numpy as np
import requests

from bs4 import BeautifulSoup
print("All is initialized...")

All is initialized...


To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

__Take source data:__

In [4]:
source_data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
source_data_lxml= BeautifulSoup(source_data.content, 'lxml')
#print(source_data_lxml)

toronto_table = source_data_lxml.find_all('table')[0]
toronto_df = pd.read_html(str(toronto_table))[0]

#toronto_df
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


__and clean data__

remove values 'Not assigned' in column 'Borough '

In [7]:
toronto_df= toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df.head()
#toronto_df

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


group by 'postcode' and 'Borough':

In [13]:
toronto_df_grouped = toronto_df.groupby(['Postcode','Borough'], sort=False).agg(lambda x: ', '.join(x))
toronto_df_grouped.reset_index(level=['Postcode', 'Borough'], inplace=True)
#toronto_df_grouped
toronto_df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


some of 'Not assigned' should be replaced: 

In [18]:
toronto_df_grouped['Neighbourhood'].replace('Not assigned', toronto_df_grouped['Borough'], inplace=True)
toronto_df_final=toronto_df_grouped
#toronto_df_final
toronto_df_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


# Second part
### Take longitude/latitude using geocoder

install geocoder:

In [20]:
! pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 18.0MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [35]:
#import of goecoder library
import geocoder

get long and lat:

In [36]:
toronto_loc=toronto_df_final
#print(geocoder.arcgis('M3A, Toronto, Ontario').latlng[0])
#print(geocoder.arcgis('M3A, Toronto, Ontario').latlng[1])
    

for enum, postcode in enumerate(toronto_loc['Postcode']):
    loc=geocoder.arcgis(f'{postcode}, Toronto, Ontario')
    toronto_loc.at[enum, 'Latitude'] = loc.latlng[0]
    toronto_loc.at[enum, 'Longitude'] = loc.latlng[1]

#toronto_loc    
toronto_loc.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75242,-79.329242
1,M4A,North York,Victoria Village,43.7306,-79.313265
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451286
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715


# Third part

Explore and cluster the neighborhoods in Toronto

install and import map library:

In [38]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    ------------------------------------------------------------
                       

In [44]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          92 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0   conda-forge
    geopy:         1.21.0-py_0 conda-forge


Downloading and Extracting Packages
geographiclib-1.50   | 34 KB     | ##################################### | 100% 
geopy-1.21.0         | 58 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [46]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))    

The geograpical coordinate of Toronto are 43.653963, -79.387207.


create map of toronto:

In [48]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_loc['Latitude'], toronto_loc['Longitude'], toronto_loc['Borough'], toronto_loc['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare Credentials and Version

I will use the function we defined earlier in the labs.

In [51]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I chose the 'West Toronto':

In [61]:
WT =  toronto_loc['Borough']=="West Toronto"
WT = toronto_loc[WT].reset_index()
WT=WT[WT.Latitude != '']
WT


Unnamed: 0,index,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,31,M6H,West Toronto,"Dovercourt Village, Dufferin",43.665087,-79.438705
1,37,M6J,West Toronto,"Little Portugal, Trinity",43.648525,-79.417757
2,43,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village",43.63941,-79.424362
3,69,M6P,West Toronto,"High Park, The Junction South",43.659935,-79.463019
4,75,M6R,West Toronto,"Parkdale, Roncesvalles",43.64787,-79.449776
5,81,M6S,West Toronto,"Runnymede, Swansea",43.64962,-79.476141


In [62]:
WT_Venues = getNearbyVenues(names=WT['Neighbourhood'],
                                   latitudes=WT['Latitude'],
                                   longitudes=WT['Longitude']
                                  )

Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction South
Parkdale, Roncesvalles
Runnymede, Swansea


In [63]:
print(WT_Venues.shape)
WT_Venues.head()

(233, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Dovercourt Village, Dufferin",43.665087,-79.438705,Rosie Robin A Touch Of Convenience,43.663182,-79.435427,Café
1,"Dovercourt Village, Dufferin",43.665087,-79.438705,The Greater Good Bar,43.669409,-79.439267,Bar
2,"Dovercourt Village, Dufferin",43.665087,-79.438705,Parallel,43.669516,-79.438728,Middle Eastern Restaurant
3,"Dovercourt Village, Dufferin",43.665087,-79.438705,Happy Bakery & Pastries,43.66705,-79.441791,Bakery
4,"Dovercourt Village, Dufferin",43.665087,-79.438705,FreshCo,43.667918,-79.440754,Grocery Store


In [65]:
print('There are {} uniques categories.'.format(len(WT_Venues['Venue Category'].unique())))

There are 99 uniques categories.


One-hot encoding from the NY exaple:

In [71]:
# one hot encoding
TO_onehot = pd.get_dummies(WT_Venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
TO_onehot['Neighbourhood'] = WT_Venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [TO_onehot.columns[-1]] + list(TO_onehot.columns[:-1])
TO_onehot = TO_onehot[fixed_columns]

TO_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bank,Bar,Beer Bar,...,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Dovercourt Village, Dufferin",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Dovercourt Village, Dufferin",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,"Dovercourt Village, Dufferin",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Dovercourt Village, Dufferin",0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Dovercourt Village, Dufferin",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
TO_onehot.shape

(233, 100)

In [74]:
WT_grouped = TO_onehot.groupby('Neighbourhood').mean().reset_index()
WT_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bank,Bar,Beer Bar,...,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Brockton, Exhibition Place, Parkdale Village",0.0,0.014706,0.014706,0.0,0.0,0.044118,0.0,0.044118,0.014706,...,0.0,0.014706,0.014706,0.0,0.014706,0.0,0.029412,0.014706,0.0,0.0
1,"Dovercourt Village, Dufferin",0.0,0.0,0.0,0.0,0.0,0.066667,0.066667,0.066667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"High Park, The Junction South",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Little Portugal, Trinity",0.019231,0.019231,0.0,0.057692,0.0,0.038462,0.0,0.076923,0.0,...,0.0,0.0,0.0,0.0,0.019231,0.0,0.0,0.038462,0.057692,0.019231
4,"Parkdale, Roncesvalles",0.038462,0.0,0.019231,0.0,0.019231,0.038462,0.019231,0.019231,0.0,...,0.038462,0.0,0.0,0.038462,0.0,0.019231,0.0,0.019231,0.0,0.0
5,"Runnymede, Swansea",0.0,0.0,0.0,0.0,0.0,0.068182,0.045455,0.0,0.0,...,0.022727,0.0,0.022727,0.022727,0.0,0.0,0.022727,0.0,0.0,0.0


In [75]:
WT_grouped.shape

(6, 100)

Check top 5 venue cathegories per neigbourhood:

In [84]:
num_top_venues = 5

for hood in WT_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = WT_grouped[WT_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Brockton, Exhibition Place, Parkdale Village----
                    venue  freq
0             Coffee Shop  0.10
1              Restaurant  0.07
2                    Café  0.07
3  Furniture / Home Store  0.06
4                  Bakery  0.04


----Dovercourt Village, Dufferin----
                    venue  freq
0  Furniture / Home Store  0.13
1                    Park  0.13
2    Gym / Fitness Center  0.07
3                     Gym  0.07
4                    Café  0.07


----High Park, The Junction South----
                 venue  freq
0    Convenience Store   0.5
1                 Park   0.5
2  American Restaurant   0.0
3        Movie Theater   0.0
4          Pizza Place   0.0


----Little Portugal, Trinity----
              venue  freq
0               Bar  0.08
1       Coffee Shop  0.08
2          Wine Bar  0.06
3  Asian Restaurant  0.06
4      Cocktail Bar  0.06


----Parkdale, Roncesvalles----
                         venue  freq
0                  Coffee Shop  0.08
1  Eastern E

In [85]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [86]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = WT_grouped['Neighbourhood']

for ind in np.arange(WT_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(WT_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Restaurant,Furniture / Home Store,Bar,Bakery,Vegetarian / Vegan Restaurant,Gym,Supermarket,Italian Restaurant
1,"Dovercourt Village, Dufferin",Park,Furniture / Home Store,Pharmacy,Grocery Store,Pool,Middle Eastern Restaurant,Café,Smoke Shop,Gym,Gym / Fitness Center
2,"High Park, The Junction South",Convenience Store,Park,Flower Shop,Dance Studio,Deli / Bodega,Dessert Shop,Diner,Eastern European Restaurant,Ethiopian Restaurant,Event Space
3,"Little Portugal, Trinity",Bar,Coffee Shop,Wine Bar,Restaurant,Asian Restaurant,Cocktail Bar,Vietnamese Restaurant,Bakery,Pizza Place,Korean Restaurant
4,"Parkdale, Roncesvalles",Coffee Shop,Eastern European Restaurant,Restaurant,American Restaurant,Café,Food & Drink Shop,Sushi Restaurant,Bookstore,Gift Shop,Thai Restaurant


Find 4 clusters of the neighbourhoods on venue similarity:

In [87]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 4

WT_grouped_clustering = WT_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(WT_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 1, 3, 0, 0], dtype=int32)

In [88]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

WT_merged = WT

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
WT_merged = WT_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

WT_merged.head() # check the last columns!

Unnamed: 0,index,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,31,M6H,West Toronto,"Dovercourt Village, Dufferin",43.665087,-79.438705,2,Park,Furniture / Home Store,Pharmacy,Grocery Store,Pool,Middle Eastern Restaurant,Café,Smoke Shop,Gym,Gym / Fitness Center
1,37,M6J,West Toronto,"Little Portugal, Trinity",43.648525,-79.417757,3,Bar,Coffee Shop,Wine Bar,Restaurant,Asian Restaurant,Cocktail Bar,Vietnamese Restaurant,Bakery,Pizza Place,Korean Restaurant
2,43,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village",43.63941,-79.424362,0,Coffee Shop,Café,Restaurant,Furniture / Home Store,Bar,Bakery,Vegetarian / Vegan Restaurant,Gym,Supermarket,Italian Restaurant
3,69,M6P,West Toronto,"High Park, The Junction South",43.659935,-79.463019,1,Convenience Store,Park,Flower Shop,Dance Studio,Deli / Bodega,Dessert Shop,Diner,Eastern European Restaurant,Ethiopian Restaurant,Event Space
4,75,M6R,West Toronto,"Parkdale, Roncesvalles",43.64787,-79.449776,0,Coffee Shop,Eastern European Restaurant,Restaurant,American Restaurant,Café,Food & Drink Shop,Sushi Restaurant,Bookstore,Gift Shop,Thai Restaurant


## Final Visualization:

In [90]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[43.7337, -79.5175], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(WT_merged['Latitude'], WT_merged['Longitude'], WT_merged['Neighbourhood'], WT_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters