# First phase notebook: Segmenting and Clustering Neighborhoods in Toronto
TOC to be completed later

In [392]:
# importing libraries
! pip install lxml html5lib beautifulsoup4
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize



You should consider upgrading via the 'c:\users\king aron\anaconda3\python.exe -m pip install --upgrade pip' command.


## 1st step, importing the dataset
in this step the dataset is read using pandas library. Then its 5 first row printed. The desired table is stored in the first table of url.

In [6]:
# importing dataset
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 2nd step, cleaning and forming the dataset
According to the provided instruction, unique postal codes are analysed. Becasue the number of unique codes are the same of the current recodes, there is no need of merging or combining rows. In the next stage, Borough without assigned values are deleted. Then neigbourhoods without assigned value are investigated. Becasue there are no rows with such a specification, no cell is replaced with its borough. Finally, the shape of the dataset is printed and the last 5 rows are shown.

In [7]:
# cleaning and forming the dataset
print('The dataset includes {} records with {} unique postal codes \n'.format(len(df) , len(df['Postal Code'].unique())))
# igonring cells that Borough is not assigned
df = df[df['Borough'] != 'Not assigned']
df.reset_index(inplace = True, drop = True)
print('Aftering deleting rows without assigned boroughs, the number of records reduced to {} \n'.format(len(df)))
# assigning Borough to Neighbourhood where Neighbourhood is 'Not assigned'
n_na_neighbour = df['Neighbourhood'][df['Neighbourhood'] == 'Not assigned'].count()
print('After correcting NA boroughs, {} neighbourhoods found without assigned value \n'.format(n_na_neighbour))
print('the final shape of the dataset is {} \n'.format(df.shape))
df.tail()

The dataset includes 180 records with 180 unique postal codes 

Aftering deleting rows without assigned boroughs, the number of records reduced to 103 

After correcting NA Boroughs, 0 neighbourhoods found without assigned value 

the final shape of the dataset is (103, 3) 



Unnamed: 0,Postal Code,Borough,Neighbourhood
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
102,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


## 3rd step, transforming the database
In the next phase of the project, each neighbourhood's coordinates should be found. So, having their name stored in a single cell is not desirable. The ideal form of dataset is having neighbourhood name in a cell, preferably set as index.

In [65]:
# creating a dataset setting each neighbourhood in one row
dfn = pd.DataFrame(columns = ['Postal Code', 'Borough', 'Neighbourhood'])
for nn in range(0, len(df) - 1):
    borough = df['Borough'].iloc[nn]
    post_code = df['Postal Code'].iloc[nn]
    neighbourhoods = df['Neighbourhood'].iloc[nn].split(', ')
    for neighbourhood in neighbourhoods: 
        dfn_add = pd.DataFrame({'Borough': [borough], 'Postal Code': [post_code], 'Neighbourhood' : [neighbourhood]})
        dfn = dfn.append(dfn_add, ignore_index=True)
print('the dataset includes {} neighbourhoods \n'.format(len(dfn)))
dfn.tail()

the dataset includes 212 neighbourhoods 



Unnamed: 0,Postal Code,Borough,Neighbourhood
207,M8Y,Etobicoke,Humber Bay
208,M8Y,Etobicoke,Mimico NE
209,M8Y,Etobicoke,The Queensway East
210,M8Y,Etobicoke,Royal York South East
211,M8Y,Etobicoke,Kingsway Park South East


## 4th step, finding coordinates
According to the provided instructions of the assignment, geocoder is used in a while loop to find the corresponding long/lat of each rows in the newly transformed dataset. Unfortunately, it has not ended to any plausible result. So, I used instead geopy which made it possible. Two columns have been added to the new dataset.
there are several differences which made the code into work:
1. using geopy, Nominatim
2. passing GeocoderTimedOut for avoiding errors of timing out
3. setting a search limit for a neighbourhood
4. using sleep of 1 sec for avoiding server runtime limit block
5. passing a random symbolic password
6. random ordering of address 
<br>

Finally, geocoder fails to locate some neibourhoods. These records should be handled manually.

In [173]:
!pip install geopy
import geopy
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut



You should consider upgrading via the 'c:\users\king aron\anaconda3\python.exe -m pip install --upgrade pip' command.


In [261]:
def do_geocode(address):
    geopy = Nominatim(user_agent="aron.shirazi@gmail.com")
    try:
        sleep(1)
        return geopy.geocode(address)
    except GeocoderTimedOut:
        return do_geocode(address)

dfn['latitude'] = 'NA'
dfn['longitude'] = 'NA'
max_try = 10
for nn in range(0, len(dfn)):
    neighbourhood = dfn['Neighbourhood'].iloc[nn]
    location = None
    count = 0
    while (location == None) & (count < max_try):
        password = ''.join(random.choice(['#', '$', '%', '@', '*', '-', '&', '~', '!']) for i in range(8))
        address_list = [neighbourhood, 'Toronto', 'Ontario', password]
        order = ''.join(random.sample(['0', '1', '2', '3'], 4))
        n0 = int(order[0]); n1 = int(order[1]); n2 = int(order[2]); n3 = int(order[3])
        address = '{}, {}, {}, {}'.format(address_list[n0], address_list[n1], address_list[n2], address_list[n3])
        location = do_geocode(address)
        count += 1
    if location is not None:
        print('{}, coordinates found for {}'.format(nn, neighbourhood))
        dfn['latitude'].iloc[nn] = location.latitude
        dfn['longitude'].iloc[nn] = location.longitude
    else:
        print('{}, coordinates not found for {}'.format(nn, neighbourhood))

0, coordinates found for Parkwoods
1, coordinates found for Victoria Village
2, coordinates found for Regent Park
3, coordinates found for Harbourfront
4, coordinates found for Lawrence Manor
5, coordinates found for Lawrence Heights
6, coordinates found for Queen's Park
7, coordinates not found for Ontario Provincial Government
8, coordinates found for Islington Avenue
9, coordinates found for Humber Valley Village
10, coordinates found for Malvern
11, coordinates found for Rouge
12, coordinates found for Don Mills
13, coordinates found for Parkview Hill
14, coordinates found for Woodbine Gardens
15, coordinates found for Garden District
16, coordinates found for Ryerson
17, coordinates found for Glencairn
18, coordinates found for West Deane Park
19, coordinates found for Princess Gardens
20, coordinates found for Martin Grove
21, coordinates found for Islington
22, coordinates found for Cloverdale
23, coordinates found for Rouge Hill
24, coordinates found for Port Union
25, coordina

In [349]:
# finding unlocated neighbourhoods to set the m manually
dfn['latitude'][~dfn['latitude'].apply(np.isreal)] = '0'
dfn['longitude'][~dfn['longitude'].apply(np.isreal)] = '0'
dfn['latitude'] = dfn['latitude'].astype('float', errors='ignore')
dfn['longitude'] = dfn['longitude'].astype('float', errors='ignore')
dfn.dtypes

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Postal Code       object
Borough           object
Neighbourhood     object
latitude         float64
longitude        float64
dtype: object

In [355]:
dfo = pd.read_csv('Geospatial_Coordinates.csv')
for nn in dfn[dfn['longitude'] == 0].index:
    dfn['latitude'].iloc[nn] = float(dfo['Latitude'][dfo['Postal Code'] == dfn['Postal Code'].iloc[nn]])
    dfn['longitude'].iloc[nn] = float(dfo['Longitude'][dfo['Postal Code'] == dfn['Postal Code'].iloc[nn]])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [367]:
dfn.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
3,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015
4,M6A,North York,Lawrence Manor,43.722079,-79.437507
5,M6A,North York,Lawrence Heights,43.722778,-79.450933
6,M7A,Downtown Toronto,Queen's Park,43.659659,-79.39034
7,M7A,Downtown Toronto,Ontario Provincial Government,43.662301,-79.389494
8,M9A,Etobicoke,Islington Avenue,43.606597,-79.506456
9,M9A,Etobicoke,Humber Valley Village,43.666472,-79.524314


In [368]:
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [381]:
# finding the center of map for illustration purpuse
df_loc = dfn[dfn['Neighbourhood'] != 'South Niagara'] # South Niagra is far away and makes our analysis inefficient so it is omitted
center_lat = df_loc['latitude'].mean()
center_lon = df_loc['longitude'].mean()
# to set boundaries of folium
lat_min = df_loc['latitude'].min()
lat_max = df_loc['latitude'].max()
lon_min = df_loc['longitude'].min()
lon_max = df_loc['longitude'].max()

In [384]:
map_toronto = folium.Map(location=[center_lat, center_lon], width=750, height=500)
map_toronto.fit_bounds([[lat_min, lon_min], [lat_max, lon_max]])
# add markers to map
for lat, lng, label in zip(df_loc['latitude'], df_loc['longitude'], df_loc['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
map_toronto

In [467]:
# Defining Foursquare Credentials and Version
# @hidden_cell
CLIENT_ID = 'XRZP0Y5N1UHOVN30BRXKVHXNHP2DKVOPKWAUHZUG32VQTLPA' # your Foursquare ID
CLIENT_SECRET = 'YDJPYTGZTRMKPD21E5HQQ2NDSKQM1DTRLM3RR0NPICCZN3XZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [451]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def find_venue(lat, lon, limit = 100, radius = 500):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    lon, 
    radius, 
    limit)
    results = requests.get(url).json()
    venues = results['response']['groups'][0]['items']
    nearby_venues = None
    if len(venues) > 0:
        nearby_venues = pd.json_normalize(venues) # flatten JSON
        # filter columns
        filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        nearby_venues =nearby_venues.loc[:, filtered_columns]
        # filter the category for each row
        nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
        # clean columns
        nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    return nearby_venues

In [457]:
df_venues = pd.DataFrame(columns = ['name', 'categories', 'lat', 'lng'])
nn = 0
for name, lat, lng in zip(df_loc['Neighbourhood'], df_loc['latitude'], df_loc['longitude']):
    df_tr = find_venue(lat, lon)
    if df_tr is None: 
        len_found = 0 
    else: 
        len_found = len(df_tr)
        df_tr['neighbourhood'] = name
    print('{}, venues of {} explored at lat: {} and long: {}, with {} venues'.format(nn, name, lat, lng, len_found))
    df_venues = pd.concat([df_venues, df_tr])
    nn += 1
df_venues.reset_index(inplace = True, drop = True)
print('venues of Toronto are explored, the dataset shape is {} \n'.format(df_venues.shape))
df_venues

0, venues of Parkwoods explored at lat: 43.7587999 and long: -79.3201966, with 18 venues
1, venues of Victoria Village explored at lat: 43.732658 and long: -79.3111892, with 4 venues
2, venues of Regent Park explored at lat: 43.6607056 and long: -79.3604569, with 0 venues
3, venues of Harbourfront explored at lat: 43.6400801 and long: -79.3801495, with 0 venues
4, venues of Lawrence Manor explored at lat: 43.7220788 and long: -79.4375067, with 2 venues
5, venues of Lawrence Heights explored at lat: 43.7227784 and long: -79.4509332, with 3 venues
6, venues of Queen's Park explored at lat: 43.659659 and long: -79.3903399, with 0 venues
7, venues of Ontario Provincial Government explored at lat: 43.6623015 and long: -79.3894938, with 0 venues
8, venues of Islington Avenue explored at lat: 43.6065969 and long: -79.5064562, with 0 venues
9, venues of Humber Valley Village explored at lat: 43.6664717 and long: -79.5243136, with 0 venues
10, venues of Malvern explored at lat: 43.8091955 and l

Unnamed: 0,name,categories,lat,lng,neighbourhood
0,Shoppers Drug Mart,Pharmacy,43.759825,-79.225268,Parkwoods
1,GoodLife Fitness Scarborough Cedarbrae Mall,Gym,43.759389,-79.226409,Parkwoods
2,Subway,Sandwich Place,43.759307,-79.224057,Parkwoods
3,RBC Royal Bank,Bank,43.760149,-79.224908,Parkwoods
4,Pizza Pizza,Pizza Place,43.759321,-79.224929,Parkwoods
...,...,...,...,...,...
484,Real Fruit Bubble Tea 真果茶坊,Bubble Tea Shop,43.806709,-79.222759,Upper Rouge
485,Francois' No Frills,Grocery Store,43.808416,-79.223520,Upper Rouge
486,Telfer park,Park,43.807310,-79.226403,Upper Rouge
487,Starbucks,Coffee Shop,43.770037,-79.221156,Underground city


In [459]:
df_venues.groupby('neighbourhood').count()

Unnamed: 0_level_0,name,categories,lat,lng
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agincourt,4,4,4,4
Agincourt North,15,15,15,15
Albion Gardens,10,10,10,10
Bathurst Manor,4,4,4,4
Bayview Village,5,5,5,5
...,...,...,...,...
Wilson Heights,13,13,13,13
Woburn,20,20,20,20
Woodbine Gardens,2,2,2,2
York Mills,13,13,13,13


In [461]:
print('There are {} uniques categories \n'.format(len(df_venues['categories'].unique())))
print('There are {} uniques venues \n'.format(len(df_venues['name'].unique())))

There are 60 uniques categories 

There are 89 uniques venues 

