## Introduction

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

"For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

### Import statements

In [235]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

# import packages to scrap wiki data
import requests
from bs4 import BeautifulSoup
print('Libraries imported.')

Libraries imported.


### Scraping and Wrangling Data

In [236]:
# Save url where we will get data from\n",
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Check whether we are allowed to download data from here

response = requests.get(url)
print(response.status_code)

200


In [237]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
torontotable = soup.find('table', {'rules':'all'})

# find each table rows
table_rows = torontotable.find_all('tr')

FSA = []
for table in table_rows:
    for postcode,data in zip(table.find_all('b'), table.find_all('span')):
        rows_in_td = [postcode.text]

        for rows in data.find_all('a'):
            rows_in_td.append(rows.text)

        FSA.append(rows_in_td)

# remove lists that arent the size of 3

#print(FSA)
if rows_in_td != []:
    
    for x in list(FSA):
        if len(x) < 3:
            FSA.remove(x)

        elif len(x) > 3:
            del x[3:]

# create data frame from the list FSA\n",

df = pd.DataFrame(FSA, columns = ["PostalCode", "Borough", "Neighborhood"])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M6A,North York,Lawrence Manor
4,M9A,Etobicoke,Islington Avenue


In [238]:
# get longitude values of toronto\n",
lat_long_data = pd.read_csv('https://cocl.us/Geospatial_data')

# join latlong data and toronoto data by postal code\n",

toronto_df = df.join(lat_long_data.set_index('Postal Code'), on='PostalCode')
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
3,M6A,North York,Lawrence Manor,43.718518,-79.464763
4,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242


In [239]:
# get the coordinatesg for toronto
address = 'Toronto, Ca'

geolocator = Nominatim(user_agent='ca_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto is {} {}.'.format(latitude, longitude))

The geographical coordinate of Toronto is 43.6534817 -79.3839347.


In [240]:
# create map of Toronto using latitude and longitude values\n",
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], 
                                           toronto_df['Longitude'], 
                                           toronto_df['Borough'], 
                                           toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat, lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(toronto_map)  

toronto_map

In [241]:
# Define Foursquare Credentials and Version
CLIENT_ID = '0BKDKQNCTVRN2LTH0WDXJ4FDSHNE3KMHCMZWK5HM3GXTD5KU'
CLIENT_SECRET = '53QE3SDM52XLBB3EUB4PSEFVRUYNSOXQP3XCNJ0FV3KLEQK4'
VERSION = '20180605'
LIMIT = 100


In [242]:
# Exploreing the First Neighborhood in our Data
toronto_df.loc[0, 'Neighborhood']

'Parkwoods'

In [243]:
# getting lat and long
neighborhood_latitude = toronto_df.loc[0, 'Latitude']
neighborhood_longitude = toronto_df.loc[0, 'Longitude']
# get name
neighborhood_name = toronto_df.loc[0, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(
    neighborhood_name, 
    neighborhood_latitude,
    neighborhood_longitude
    )
)



Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


In [244]:
# Getting the frist 100 venues in Parkwoods within 500 meter radius
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret\
={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

In [245]:
# send the Get request and examine the results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '6086625dd274d673708c208d'},
 'response': {'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA'

In [246]:
# function that extracts the category of the venue\n",
def get_category_type(row):

    #categories_list = row['categories']

    categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [247]:
venues = results['response']['groups'][0]['items']
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row\n",
nearby_venues['categories_'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()
nearby_venues.drop(['categories'], axis=1)

Unnamed: 0,name,lat,lng,categories_
0,Brookbanks Park,43.751976,-79.33214,Park
1,649 Variety,43.754513,-79.331942,Convenience Store
2,Brookbanks Pool,43.751389,-79.332184,Pool
3,Variety Store,43.751974,-79.333114,Food & Drink Shop


In [248]:
# print the number of venues returned
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))


4 venues were returned by Foursquare.


In [249]:
# function to apply to individual neighborhoods in toronto\n",
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL\n",
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}\
        &client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET, 
            VERSION, 
            lat,
            lng, 
            radius,
            LIMIT)

        # make the GET request\n",
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
        name, 
        lat, 
        lng, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'], 
        v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame(
    [item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                         'Neighborhood Latitude',
                         'Neighborhood Longitude',
                         'Venue',
                         'Venue Latitude', 
                         'Venue Longitude', 
                         'Venue Category']

    return(nearby_venues)

In [250]:
 toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                    latitudes=toronto_df['Latitude'],
                                    longitudes=toronto_df['Longitude']
                                   )

Parkwoods
Victoria Village
Regent Park
Lawrence Manor
Islington Avenue
Malvern
Don Mills
Parkview Hill
Garden District
West Deane Park
Rouge Hill
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Markland Wood
Guildwood
Caledonia-Fairbanks
Woburn
Leaside
Bay Street
Cedarbrae
Hillcrest Village
Bathurst Manor
Thorncliffe Park
Richmond
Dovercourt Village
Scarborough Village
Henry Farm
Northwood Park
The Danforth
Harbourfront
Trinity
Kennedy Park
Bayview Village
Downsview
Riverdale
Toronto Dominion Centre
Parkdale Village
Golden Mile
York Mills
Downsview
The Beaches
Commerce Court
Maple Leaf Park
Humber Summit
Cliffside
Newtonbrook
Downsview
Bedford Park
Mount Dennis
Humberlea
Birch Cliff
Willowdale
Downsview
Runnymede
Weston
Dorset Park
York Mills
The Junction
Wexford
Willowdale
North Midtown
Roncesvalles
Kingsview Village
Agincourt
University of Toronto
Swansea
Tam O'Shanter
Summerhill
Kensington Market
Milliken
Rathnelly
CN Tower
New Toronto
South Steeles
Steeles
Rosedale
Ald

In [251]:
 # check size of the df
print(toronto_venues.shape)
toronto_venues.head()
   

(1743, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store
2,Parkwoods,43.753259,-79.329656,Brookbanks Pool,43.751389,-79.332184,Pool
3,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
4,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena


In [252]:
# find how many venues returned for each neighborhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
Alderwood,8,8,8,8,8,8
Bathurst Manor,23,23,23,23,23,23
Bay Street,61,61,61,61,61,61
Bayview Village,4,4,4,4,4,4
...,...,...,...,...,...,...
Wexford,5,5,5,5,5,5
Willowdale,39,39,39,39,39,39
Woburn,3,3,3,3,3,3
Woodbine Heights,7,7,7,7,7,7


In [253]:
# find uique categories curated by the returned values
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 251 uniques categories.


In [254]:
# one hot encoding\n",
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", 
                                prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [255]:
# size of data frame
toronto_onehot.shape

(1743, 251)

In [256]:
# group rows by neighborhood. Take average frequency of categories

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
1,Alderwood,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
2,Bathurst Manor,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
3,Bay Street,0.016393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.016393,0.0,0.0,0.000000,0.0,0.016393,0.0,0.0,0.0
4,Bayview Village,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Wexford,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
73,Willowdale,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.025641,0.0,0.000000,0.0,0.0,0.0
74,Woburn,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
75,Woodbine Heights,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0


In [257]:
# get new size
toronto_grouped.shape

(77, 251)

In [258]:
num_top_venues = 5

# function with neghbor goods top 5 most common venues
for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('')

----Agincourt----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4              Luggage Store  0.00

----Alderwood----
            venue  freq
0     Pizza Place  0.25
1        Pharmacy  0.12
2    Skating Rink  0.12
3  Sandwich Place  0.12
4             Pub  0.12

----Bathurst Manor----
                       venue  freq
0                Coffee Shop  0.09
1                       Bank  0.09
2  Middle Eastern Restaurant  0.04
3                Supermarket  0.04
4        Fried Chicken Joint  0.04

----Bay Street----
                venue  freq
0         Coffee Shop  0.18
1      Sandwich Place  0.07
2  Italian Restaurant  0.05
3                Café  0.05
4     Bubble Tea Shop  0.03

----Bayview Village----
                 venue  freq
0                 Café  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4    Mobile Phone

In [259]:
# function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]


In [260]:
# dataframe that diplays the top 10 venues for each neighborhood
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues\n",
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe\n",
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

    for ind in np.arange(toronto_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(
            toronto_grouped.iloc[ind, :], num_top_venues)

    neighborhoods_venues_sorted.head()

ValueError: could not broadcast input array from shape (10) into shape (1)

In [None]:
    # set number of clusters
    kclusters = 5

    toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
,
    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(
        toronto_grouped_clustering)

    # check cluster labels generated for each row in the dataframe
    kmeans.labels_[0:10] 

In [None]:
    # new dataframe with cluster and top 10 venues for each neighborhood\n",

    # add clustering labels\n",
   # neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    toronto_merged = toronto_df

    # merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
    toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

    toronto_merged.head() # check the last columns!

    toronto_merged.dropna(inplace=True)

    toronto_merged = toronto_merged.astype({"Cluster Labels": int})
    toronto_merged.head() # check the last columns
