# Neighborhoods of Toronto

###  *Obtained by scrapping the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format*


In [2]:
import requests
import pandas as pd
import io
import io
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

In [3]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())

**Obtain only the table from the html data**

In [5]:
My_table = soup.find('table',{'class':'wikitable sortable'})
#My_table

**The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [6]:
df = pd.read_html(str(My_table), header=0)
df = pd.DataFrame(df[0])

**Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**


In [7]:
df.drop(df[df.Borough == 'Not assigned'].index, inplace = True)


*More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.*
    

In [8]:
df=df.groupby("Postcode").agg(lambda x:','.join(set(x)))
df.reset_index(inplace = True)

    
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.


In [9]:
df.loc[df.Neighbourhood == 'Not assigned', 'Neighbourhood'] = df.loc[df.Neighbourhood == 'Not assigned', 'Borough']

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
df.shape

(103, 3)

In [11]:
df.head(10)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Port Union,Rouge Hill"
2,M1E,Scarborough,"West Hill,Guildwood,Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park,Ionview,East Birchmount Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffside,Cliffcrest,Scarborough Village West"
9,M1N,Scarborough,"Cliffside West,Birch Cliff"


Get Geospatial data

In [12]:
url = 'http://cocl.us/Geospatial_data'
s = requests.get(url).content
geospatial_data = pd.read_csv(io.StringIO(s.decode('utf-8')))
geospatial_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Rename columns so that they match

In [13]:
df.columns = ['Postalcode', 'Borough', 'Neighbourhood']
geospatial_data.columns = ['Postalcode', 'Latitude', 'Longitude']

Merge both dataframes

In [14]:
neighborhood = pd.merge(df, geospatial_data, on = df['Postalcode'], right_index = True, left_index = True)
neighborhood.drop('Postalcode_y', axis =1, inplace = True)
neighborhood.columns= ['Postalcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']

In [15]:
neighborhood.head(10)

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Port Union,Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"West Hill,Guildwood,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park,Ionview,East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside,Cliffcrest,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West,Birch Cliff",43.692657,-79.264848


Number of unique Boroughs in Toronto

In [16]:
print(len(neighborhood['Borough'].unique()))
print(neighborhood['Borough'].unique())

11
['Scarborough' 'North York' 'East York' 'East Toronto' 'Central Toronto'
 'Downtown Toronto' 'York' 'West Toronto' "Queen's Park" 'Mississauga'
 'Etobicoke']


**Create a map of Toronto**

In [17]:
neighborhood.columns

Index(['Postalcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

In [18]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [20]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhoods in zip(neighborhood['Latitude'], neighborhood['Longitude'], neighborhood['Borough'], neighborhood['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Let us examine, segment and clustor only neighborhoods of 'North York'**

In [21]:
neighborhood.groupby("Borough").count()

Unnamed: 0_level_0,Postalcode,Neighbourhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,18,18,18,18
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Queen's Park,1,1,1,1
Scarborough,17,17,17,17
West Toronto,6,6,6,6


Since North york has max number neighborhoods... let us setgemnt and cluster neighborhoods of North York

In [22]:
northyork_data = neighborhood[neighborhood['Borough'] == 'North York'].reset_index(drop=True)

In [23]:
northyork_data.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Oriole,Henry Farm,Fairview",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"York Mills,Silver Hills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493


Lets plot North york map

In [24]:
address = 'North York, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7708175, -79.4132998.


In [25]:
# create map of TNorth york using latitude and longitude values
map_northyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhoods in zip(northyork_data['Latitude'], northyork_data['Longitude'], northyork_data['Borough'], northyork_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyork)  
    
map_northyork

Foursquare credentials and version

In [26]:
CLIENT_ID = 'LXRTQ4XM1403FUBFBJC5FV5IR4AYGWK4KNMBDC3ITQNW0XNX' # your Foursquare ID
CLIENT_SECRET = 'BRD5IUBAEQ55J3WPIJHSSP3CURLSYUURJKMA3BVSLUES0X4B' # your Foursquare Secret
VERSION = '20190131' # Foursquare API version

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius


print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LXRTQ4XM1403FUBFBJC5FV5IR4AYGWK4KNMBDC3ITQNW0XNX
CLIENT_SECRET:BRD5IUBAEQ55J3WPIJHSSP3CURLSYUURJKMA3BVSLUES0X4B


Lets explore all neighborhoods in North York

In [27]:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
northyork_data.columns


Index(['Postalcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

In [29]:
northyork_venues = getNearbyVenues(names=northyork_data['Neighbourhood'],
                                   latitudes=northyork_data['Latitude'],
                                   longitudes=northyork_data['Longitude']
                                  )

Hillcrest Village
Oriole,Henry Farm,Fairview
Bayview Village
York Mills,Silver Hills
Newtonbrook,Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park,Don Mills South
Wilson Heights,Downsview North,Bathurst Manor
York University,Northwood Park
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Lawrence Manor East,Bedford Park
Lawrence Manor,Lawrence Heights
Glencairn
Downsview,Upwood Park,North Park
Humber Summit
Humberlea,Emery


In [30]:
northyork_venues.shape

(252, 7)

In [31]:
northyork_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Eagle's Nest Golf Club,43.805455,-79.364186,Golf Course
1,Hillcrest Village,43.803762,-79.363452,AY Jackson Pool,43.804515,-79.366138,Pool
2,Hillcrest Village,43.803762,-79.363452,Villa Madina,43.801685,-79.363938,Mediterranean Restaurant
3,Hillcrest Village,43.803762,-79.363452,Duncan Creek Park,43.805539,-79.360695,Dog Run
4,"Oriole,Henry Farm,Fairview",43.778517,-79.346556,The LEGO Store,43.778207,-79.343483,Toy / Game Store


Lets figure out how many unique venues category

In [32]:
print('There are {} uniques categories.'.format(len(northyork_venues['Venue Category'].unique())))

There are 112 uniques categories.


Let us cluster based on neighborhoods

In [33]:
# one hot encoding
northyork_onehot = pd.get_dummies(northyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
northyork_onehot['Neighborhood'] = northyork_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [northyork_onehot.columns[-1]] + list(northyork_onehot.columns[:-1])
northyork_onehot = northyork_onehot[fixed_columns]

northyork_grouped = northyork_onehot.groupby('Neighborhood').mean().reset_index()

In [34]:
northyork_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"CFB Toronto,Downsview East",0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Top common venues

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = northyork_grouped['Neighborhood']

for ind in np.arange(northyork_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(northyork_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Electronics Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store
1,"CFB Toronto,Downsview East",Park,Airport,Other Repair Shop,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store
2,Don Mills North,Gym / Fitness Center,Caribbean Restaurant,Café,Baseball Field,Japanese Restaurant,Women's Store,Electronics Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega
3,Downsview Central,Korean Restaurant,Home Service,Food Truck,Baseball Field,Women's Store,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store
4,Downsview Northwest,Grocery Store,Gym / Fitness Center,Athletics & Sports,Discount Store,Liquor Store,Electronics Store,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega


In [37]:
# set number of clusters
kclusters = 20

northyork_grouped_clustering = northyork_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(northyork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:24]

array([10,  4, 14,  8, 12, 11, 16,  0,  7,  5,  2,  1, 18, 19,  0,  9, 13,
       18, 15, 18,  6,  3, 17], dtype=int32)

In [38]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

northyork_merged = northyork_data

northyork_merged = northyork_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

northyork_merged.head() # check the last c

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,5.0,Golf Course,Dog Run,Pool,Mediterranean Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store
1,M2J,North York,"Oriole,Henry Farm,Fairview",43.778517,-79.346556,0.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Restaurant,Toy / Game Store,Metro Station,Tea Room,Bakery,Kids Store,Japanese Restaurant
2,M2K,North York,Bayview Village,43.786947,-79.385975,10.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Electronics Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store
3,M2L,North York,"York Mills,Silver Hills",43.75749,-79.374714,3.0,Cafeteria,Women's Store,Comfort Food Restaurant,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
4,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493,,,,,,,,,,,


In [39]:
northyork_merged['Cluster Labels'].replace(np.NaN, 0, inplace = True)
northyork_merged['Cluster Labels']=northyork_merged['Cluster Labels'].astype('int')

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(northyork_merged['Latitude'], northyork_merged['Longitude'], northyork_merged['Neighbourhood'], northyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters