<a href="https://colab.research.google.com/github/1jlal/Coursera_Capstone/blob/master/Segmenting_and_Clustering_Toronto_City_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Segmenting and Clustering Toronto Neighborhoods Data**
---



# _ Part 1 _



In [2]:
import requests
from bs4 import BeautifulSoup as bs
import numpy as np

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Scraping Toronto postal codes table from wikipedia page using BeautifulSoup library and converting it into a pandas DataFrame

In [3]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
webpage = bs(r.content)

table = webpage.select('table.wikitable')[0]
# print(table)
columns = table.find_all('th')
column_names = [str(c.string).strip() for c in columns]
# print(column_names)

l = []
table_rows = table.find('tbody').find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.string).strip() for tr in td]
    l.append(row)
# print(l[0:10])

df = pd.DataFrame(l, columns=column_names)

df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [4]:
df = df.drop(index=[0], axis=1)
df.reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Removing all Boroughs from dataframe with no labels

In [5]:
df = df[df['Borough'] != 'Not assigned']

df[df['Neighbourhood'] == 'Not assigned'].count()

df.head()
df.shape

(103, 3)

# _ Part 2 _


Using geocoder to obtain coordinates of the postal code doesn't function accurately. So we will skip this code and use the csv file instead

In [6]:
# '''
# import geocoder
# latitude=[]
# longitude=[]
# for code in df_new['Postal Code']:
#     g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
#     print(code, g.latlng)
#     while (g.latlng is None):
#         g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
#         print(code, g.latlng)
#     latlng = g.latlng
#     latitude.append(latlng[0])
#     longitude.append(latlng[1])
    
    
# coord_data = [latitude, longitude] 
# coord_labels = ['Latitude', 'Longitude']
# coord_df = pd.DataFrame(coord_data).T
# coord_df.columns = coord_labels
# coord_df.head()


# df_cnd = pd.concat([df_new, coord_df], axis=1)
# df_cnd

# '''

Extracting the geo coordinates from the csv file into a dataframe

In [7]:
# reading csv file into dataframe

url='http://cocl.us/Geospatial_data'

df_coord =pd.read_csv(url)
df_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Sorting data by postal code to merge the two dataframes together

In [8]:
df = df.sort_values('Postal Code')
hoods = pd.merge(df, df_coord, how='right', on='Postal Code')
hoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
hoods.shape

(103, 5)

# _ Part 3 _

Checking the number of Boroughs and neighborhoods

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(hoods['Borough'].unique()),
        hoods.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


Use geopy library to get the latitude and longitude values of Toronto, Canada.


In [11]:
address = 'Toronto, CN'

geolocator = Nominatim(user_agent="CN_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6425637, -79.38708718320467.


Creating a map of Toronto with neighborhoods superimposed on top using Folium

In [12]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

#adding markers to map
for lat, lng, borough, neighbourhood in zip(hoods['Latitude'], hoods['Longitude'], hoods['Borough'], hoods['Neighbourhood']):
     label = f'{neighbourhood}, {borough}'
     label = folium.Popup(label, parse_html=True)
     folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now we'll filter the data to consist of neighbourhoods only in the 'Toronto' Area.

In [26]:
toronto_data = hoods[hoods['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.loc[0:15]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [27]:
toronto_data.shape

(39, 5)

So, there area 39 Boroughs in the Toronto area.
Now I'll plot these on the toronto map.

In [29]:
map_toronto1 = folium.Map(location=[latitude, longitude], zoom_start=13)

#adding markers to map
for lat, lng, borough, neighbourhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
     label = f'{neighbourhood}, {borough}'
     label = folium.Popup(label, parse_html=True)
     folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto1)  
    
map_toronto1

## Exploring Neighourhoods

In this section I'll be using the Foursquare API to explore the neighbourhoods and segment them.

In [30]:
CLIENT_ID = 'KCOVE3SEDEMDU1505SOPFOSZWRS21MTPETWKXB5ZEF3PUFQA' # Foursquare ID
CLIENT_SECRET = '5FNX4DN4RG10KP4LHHQS3NCCKLVS2HFUX53FSDEWQPU33J55' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KCOVE3SEDEMDU1505SOPFOSZWRS21MTPETWKXB5ZEF3PUFQA
CLIENT_SECRET:5FNX4DN4RG10KP4LHHQS3NCCKLVS2HFUX53FSDEWQPU33J55


Let's start by exploring a single neighbourhood.

In [31]:
toronto_data.loc[0, 'Neighbourhood']

'The Beaches'

In [34]:
# Getting geocoordinates of this neighbourhood
hood_lat = toronto_data.loc[0, 'Latitude']
hood_lng = toronto_data.loc[0, 'Longitude']
hood_name = toronto_data.loc[0, 'Neighbourhood']

print(f'The Neighbourhood, {hood_name}, is located on Latitude: {hood_lat} and Longitude: {hood_lng}')

The Neighbourhood, The Beaches, is located on Latitude: 43.67635739999999 and Longitude: -79.2930312


Now, let's get the top 100 venues that are in 'The Beaches' within a radius of 500 meters.

First, let's create the Get request URL.

In [35]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    hood_lat, 
    hood_lng, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=KCOVE3SEDEMDU1505SOPFOSZWRS21MTPETWKXB5ZEF3PUFQA&client_secret=5FNX4DN4RG10KP4LHHQS3NCCKLVS2HFUX53FSDEWQPU33J55&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

In [36]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5fd39463a06d855f1b49e2ad'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4bd461bc77b29c74a07d9282-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/hikingtrail_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d159941735',
         'name': 'Trail',
         'pluralName': 'Trails',
         'primary': True,
         'shortName': 'Trail'}],
       'id': '4bd461bc77b29c74a07d9282',
       'location': {'address': 'Glen Manor',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'Queen St.',
        'distance': 89,
        'formattedAddress': ['Glen Manor (Queen St.)', 'Toronto ON', 'Canada'],
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682

I'll use the following function to get venue names.

In [37]:
# function to extract the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now let's clean the json and structure it into a pandas dataframe.

In [38]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


In [40]:
print(f'{nearby_venues.shape[0]} venues were returned by Foursquare.')

4 venues were returned by Foursquare.


## Exploring all neighbourhoods in Toronto

#### Now we'll create a function to repeat the above process for all neighbourhoods in Toronto

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [44]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                 latitudes=toronto_data['Latitude'],
                                 longitudes=toronto_data['Longitude']
                                 )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West,  Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High

In [45]:
toronto_venues.shape

(1624, 7)

In [46]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [48]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,68,68,68,68,68,68
Christie,16,16,16,16,16,16
Church and Wellesley,75,75,75,75,75,75
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,9,9,9,9,9,9


In [53]:
print('There are {} unique categories'.format(len(toronto_venues['Venue Category'].unique())))

There are 235 unique categories


## Analyzing each neighbourhood