# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

# 1. Extracting Postal Codes


## 1.1 Import libraries and extract data:

In [117]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [118]:
# download url data from internet
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
text = BeautifulSoup(source, 'lxml')

## 1.2.Convert data into a DataFrame

So we are extracting tag information from the 'lxml' document:
    * <tr>: the data are stored in after <tr>..</tr>. This is the row information
    * <td>: Each row has a corresponding <td>..</td> or cell data information.

In [119]:
column_names = ['Postalcode','Borough','Neighborhood']
data = pd.DataFrame(columns = column_names)

# loop through to find postcode, borough, neighborhood 
content = text.find('div', class_='mw-parser-output')
table = content.table.tbody
postcode = 0
borough = 0
neighborhood = 0

for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postcode = td.text
            i = i + 1
        elif i == 1:
            borough = td.text
            i = i + 1
        elif i == 2: 
            neighborhood = td.text.strip('\n').replace(']','')
    data = data.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)


print(data.shape)
data.head()

(288, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,0,0,0
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


### Cleaning the Data

So we now have 288 rows with postal codes, however some of the Boroughs are 'Not assigned' and others are 0. So we will clean them out. 

In [120]:
# delete all rows with column 'Not assigned' or has value = 0 
indexNames = data[ (data['Borough'] == 'Not assigned') | (data['Borough'] == 0) ].index            
data.drop(indexNames, inplace = True)    

print(data.shape)
    
i = 0
for i in range(0,data.shape[0]):
    if data.iloc[i][2] == 'Not assigned':
        data.iloc[i][2] = data.iloc[i][1]
        i = i+1    
    
    
data = data.groupby(['Postalcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

(210, 3)


In [121]:
print(data.shape)
data.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Removing duplicates

Finally we are going to exclude duplicates and resetting the index

In [122]:
def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
grp = data.groupby(['Postalcode', 'Borough'])
df = grp.apply(neighborhood_list).reset_index(name='Neighborhood')

In [123]:
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


So the dataset with the postal codes was originally 288 rows. We then cleaned away rows 'not assigned' and rows with 0 and was down to 212 rows. Then we grouped the different 'Neighborhood' for the same 'Postalcode' in the 'Borough' and ended up with 103 rows - unique 'Postalcodes' in the dataset.

In [124]:
df.to_csv('Toronto_Q1.csv',index=False)

# 2. Adding Location data

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

In [125]:
!wget -q -O Toronto_long_lat_data.csv  http://cocl.us/Geospatial_data
df_position = pd.read_csv('Toronto_long_lat_data.csv')
df_position.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will then merge the two tables based on 'Postal Code' and the drop one of the duplicate keys: 'Postal Code'.

In [126]:
df_org = df 
df = pd.merge(df_org, df_position,
                 left_on='Postalcode', right_on='Postal Code')

df = df.drop(columns=['Postal Code'])

print(df.shape)
df.head()


(103, 5)


Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [127]:
df.to_csv('Toronto_Q2.csv',index=False)

# 3. Explore and Cluster the neighborhoods in Toronto

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

to add enough Markdown cells to explain what you decided to do and to report any observations you make.
to generate maps to visualize your neighborhoods and how they cluster together.

In [128]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [133]:
address = 'Toronto, Ontario Canada'

geolocator = Nominatim(user_agent="Toronto")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))


toronto_neighborhoods = pd.read_csv('Toronto_Q2.csv')

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [130]:
import folium

toronto_location = [location.latitude, location.longitude]

map_toronto = folium.Map(location=toronto_location, zoom_start=12)
map_toronto

In [137]:
# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_neighborhoods['Latitude'], toronto_neighborhoods['Longitude'], toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#87cefa',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto)

In [138]:
(map_toronto)

In [143]:
# @hiddel_cell
CLIENT_ID = 'PSA1WHALO2EBQIL3CA3XNWGWQWECHXMXKZGMTTJ0KRKMRZL0' # your Foursquare ID
CLIENT_SECRET = 'Z5CUZ4GLW05XN0CTNBF3T3AYA2W4NHXXFF2NWANRP1EMFWMT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [145]:

scarborough_data = toronto_neighborhoods[toronto_neighborhoods['Borough'] == 'Scarborough'].reset_index(drop=True)
scarborough_data.head(7)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029


In [156]:

LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)

In [157]:
results = requests.get(url).json()

In [159]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [161]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']  
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Japango,Sushi Restaurant,43.655268,-79.385165
2,Rolltation,Japanese Restaurant,43.654918,-79.387424
3,Sansotei Ramen 三草亭,Ramen Restaurant,43.655157,-79.386501
4,Poke Guys,Poke Place,43.654895,-79.385052
5,Tsujiri,Tea Room,43.655374,-79.385354
6,Fugo Desserts,Ice Cream Shop,43.654923,-79.387382
7,Manpuku まんぷく,Japanese Restaurant,43.653612,-79.390613
8,Karine's,Breakfast Spot,43.653699,-79.390743
9,Yueh Tung Chinese Restaurant,Chinese Restaurant,43.655281,-79.385337
