# Applied Data Science Capstone (Coursera)

*This notebook for the assignment specified in Week 3*

The aim of this assignment is to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. For the Toronto neighborhood data, a Wikipedia [page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) exists that has all the information we need to explore and cluster the neighborhoods in Toronto. 

---

## Part 1: Obtain the data from the Wikipedia page and create a working dataframe.

The first thing to do is to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset. The link to the Wikipedia page is https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

The following libraries shown in the code cell below is required for this project. Please ensure that you have these libraries installed beforehand before running the next code. The website for [PyPI, the Python Package Index](https://pypi.org/) provides instructions on how to install these libraries.

Import the required libraries for the project:

In [2]:
from bs4 import BeautifulSoup
import requests
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline 
from sklearn.cluster import KMeans 
from geopy.geocoders import Nominatim 

print('All libraries have been imported.')

All libraries have been imported.


The **requests library** is used to collect data from the assigned webpage, as specified by the ```URL``` variable in the cell below. The ```request.get()``` method was used to obtain the data from the webpage and the result assigned to the ```response``` variable. 

In [3]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(URL)

In the cell below, the ```status_code``` attribute of the ```response``` variable was called to check if the webpage was downloaded successfully. The returned code of ```200``` shows that the webpage downloaded successfully. 

In [4]:
response.status_code

200

Next, using Python’s built-in ```html.parser```, the ```response.text``` document was parsed to obtain a nested data structure. This is assigned to the variable ```soup```.

In [5]:
soup = BeautifulSoup(response.text, 'html.parser')

To check that the correct webpage was scraped, the ```title``` attribute of the ```soup```variable was called.

In [6]:
print(soup.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


The data needed to create the dataframe is within a table with 3 columns, 'Postal Code', 'Borough' and 'Neighbourhood'. On the Wikipedia [page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), HTML elements for each item can be viewed by inspecting the source. This can be done by hitting the buttons ```CTRL``` + ```SHIFT``` + ```I```, which opens the ```Developer Tools``` pane. Viewing the contents of the webpage and the  ```Developer Tools``` pane side by side makes it easier to read the HTML. Within the pane, the elements can be expanded and collapsed as desired to view the contents by clicking on the little gray riangular button. When the mouse cursor hovers over and scrolls down the list of HTML elements in the ```Developer Tools``` pane, the webpage's contents will be highlighted, which makes it easier to identify the block of HTML elements corresponding table. The elements identifying the table are: ```<table class="wikitable sortable jquery-tablesorter">```.

The ```find``` method is then used to obtain the table using the attributes specified by the ```class``` and the result assigned to the ```table``` variable.

In [7]:
table = soup.find("table", attrs={"class": "wikitable"})

The ```<tr>``` tag defines a table row. The ```find_all()``` method is used to idenfity and extract all rows using the string, "tr", from the table body (defined by the attribute, ```tbody```). The result is passed to the ```rows``` variable and the ```len``` function used to count the number of rows in the table.

In [8]:
rows = table.tbody.find_all("tr")

In [9]:
print('The number of rows found in the table is', len(rows))

The number of rows found in the table is 181


The following code shows the contents of the first item in the ```rows``` variable.

In [10]:
print(rows[0])

<tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>


The HTML tags need to be removed before creating the dataframe. A new variable called ```table_data``` is initialized and a ```for``` loop used to append each row after the HTML tags have been stripped and replaced by a tab ('\t').

In [11]:
table_data = []
for row in rows:
    table_data.append(row.text.replace('\n', '\t').strip())

A new dataframe called ```df``` was then made using **pandas**, as shown in the code below. The first 5 lines of the dataframe can be called using the ```.head()``` method.

In [12]:
df = pd.DataFrame(table_data, columns=['col1'])
df.head()

Unnamed: 0,col1
0,Postal Code\t\tBorough\t\tNeighbourhood
1,M1A\t\tNot assigned\t\tNot assigned
2,M2A\t\tNot assigned\t\tNot assigned
3,M3A\t\tNorth York\t\tParkwoods
4,M4A\t\tNorth York\t\tVictoria Village


The following cells contain codes to get the dataframe, ```df``` in the right format before it's used for any data wrangling or cleaning steps. A new dataframe is created with each substantial formatting step, to avoid having to re-make the original dataframe again if there was a mistake. Doing this also helps to keep track of the changes made to the original dataframe.

The contents of each row needs to be split using the tab delimiter, '\t'. A new dataframe, ```df1``` is created after the split.

In [13]:
df1 = df.col1.str.split('\t\t', expand=True)
df1

Unnamed: 0,0,1,2
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


The contents of the first row is the header for each column. The following code extracts that information, passing it to a list called ```header```.

In [14]:
header = df1.iloc[0]
header

0      Postal Code
1          Borough
2    Neighbourhood
Name: 0, dtype: object

A new dataframe, ```df2```, was created by extracting the rows from the second row onwards from ```df1```.

In [15]:
df2 = df1[1:]
df2

Unnamed: 0,0,1,2
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


The ```header``` list is then used to name the columns in ```df2```.

In [16]:
df2.columns = header
df2

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


The index of ```df2``` was then reset, as shown below.

In [17]:
df2.reset_index(inplace=True)

In [18]:
df2

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
0,1,M1A,Not assigned,Not assigned
1,2,M2A,Not assigned,Not assigned
2,3,M3A,North York,Parkwoods
3,4,M4A,North York,Victoria Village
4,5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...,...
175,176,M5Z,Not assigned,Not assigned
176,177,M6Z,Not assigned,Not assigned
177,178,M7Z,Not assigned,Not assigned
178,179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


After the index has been reset, a new column containing the previous index values was created ('index'). This column was then deleted using the ```.drop()``` method and the remaining columns passed to a new dataframe, ```df3```.

In [19]:
df3 = df2.drop('index', axis=1)

In [20]:
df3

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now the dataframe is ready to be wrangled or cleaned as per the instructions set out in the assignment. Before starting, a new dataframe, ```df4``` was cloned from the correctly formatted dataframe, ```df3```.

In [21]:
df4 = df3[:]
df4.columns

Index(['Postal Code', 'Borough', 'Neighbourhood'], dtype='object', name=0)

The following code shows the data types found in each column of the dataframe.

In [22]:
df4.dtypes

0
Postal Code      object
Borough          object
Neighbourhood    object
dtype: object

The step was to only process the cells that have an assigned borough and ignore cells with a borough that is 'Not assigned'. The following code tests whether the column 'Borough' has the string, 'Not assigned' and returns a Boolean result for each row.

In [23]:
df4['Borough'] == 'Not assigned'

0       True
1       True
2      False
3      False
4      False
       ...  
175     True
176     True
177     True
178    False
179     True
Name: Borough, Length: 180, dtype: bool

Cells with the string 'Not assigned' in the column 'Borough' were dropped from the dataframe. The option ```inplace = True``` signifies that the dataframe is to be overwritten when the changes are made. The resulting dataframe is then checked to view the changes made.

In [24]:
df4.drop(df4[df4['Borough'] == 'Not assigned'].index, inplace = True) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [25]:
df4

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Another way to check if there are any cells with the string 'Not assigned' in the column 'Borough' is to use the ```.any()``` method. The resulting Boolean value, ```False```, shows that there is no cell with that string in the column.

In [26]:
(df4['Borough'] == 'Not assigned').any()

False

If a cell has a borough but a 'Not assigned'  neighborhood, then the neighborhood will be the same as the borough. As asbove, the ```.any()``` method was used on the column 'Neighbourhood' to check for the string. The resulting Boolean value, ```False```, shows that there was none.

In [27]:
(df4['Neighbourhood'] == 'Not assigned').any()

False

The next step was to check if a postal code is chared by more than one neighbourhood. Taking another look at the df4 using the code below didn't help much as the entire dataframe couldn't be seen. However, the sentence below the dataframe shows that it has 103 rows and 3 columns.

In [28]:
df4

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The ```.groupby()``` method is used in the following code to group the data by values specified in the 'Postal Code'. The ```.agg()``` method is used to aggregate the data sharing the same string in the column 'Postal Code', joining the values in the column 'Neighbourhood' using a comma. The result was assigned to a new dataframe, ```df5```.

In [29]:
df5 = df4.groupby('Postal Code').agg(lambda x: ','.join(x))
df5

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


The new dataframe does not have a proper index. So the ```.reset_index(inplace=True)``` method was used to reset the index and overwrite the data in the dataframe. A quick check of the dataframe shows that it now has the proper indexing and 3 columns - 'Postal Code, 'Borough' and 'Neighborhood'.

In [30]:
df5.reset_index(inplace=True)

In [31]:
df5

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


The ```.nunique()``` method can be used to check the there are any unique values in the column 'Postal Code' and to return a result showing the count of unique values.

In [32]:
df5['Postal Code'].nunique()

103

In the last cell of this part, the ```.shape``` method is used to print the number of rows of the cleaned dataframe.

In [33]:
print('The data frame has', df5.shape[0], 'rows')

The data frame has 103 rows


--- 

## Part 2: Get the latitude and the longitude coordinates of each neighborhood.

In order to utilize the Foursquare location data, the latitude and the longitude coordinates of each neighborhood must be obtained and appended to the cleaned dataframe as two separate columns called 'Latitude' and 'Longitude'.

A comma-separated values (CSV) file containing the latitude and longtitude for each postal code has been provided (Geospatial_Coordinates.csv). The following code reads the file into a new dataframe calle ```latlng```.

In [34]:
latlng = pd.read_csv('Geospatial_Coordinates.csv')

In [35]:
latlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


As dataframe merge needs to be conducted to add the coordinates for each postal code specified in ```df5```, the data type of the ```latlng``` dataframe was checked. The values in the column 'Postal Code' in both dataframes are of the same type - ```object```.

In [36]:
latlng.dtypes

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

The following code using the ```.merge``` method to join both dataframes together based on the common values found in the column 'Postal Code' in both dataframes.

In [37]:
df6 = pd.merge(df5, latlng, on=['Postal Code'])
df6.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


This is to check that the resulting dataframe has the same number of rows as previously but with two new columns added.

In [38]:
print('The current data frame has', df6.shape[0], 'rows and', df6.shape[1], 'columns')

The current data frame has 103 rows and 5 columns


--- 

## Part 3: Explore and cluster the neighborhoods in Toronto. 

The final part involves working with boroughs that contain the word 'Toronto' and then replicate the same analysis done with the New York City data in the labs.

Renamed the dataframe created in Part 2 as ```neighborhoods```.

In [39]:
neighborhoods = df6[:]
neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


The following code uses the **geopy library** to obtain the coordinates of Toronto in Ontario, Canada.

In [40]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


The following code creates a map of Toronto using the **folium library**

In [41]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

There are 4 boroughs with the word 'Toronto' in their names - Downtown Toronto, Central Toronto, East Toronto and West Toronto. 

In [42]:
city_data = neighborhoods[neighborhoods['Borough'].isin(['Downtown Toronto', 'Central Toronto', 'East Toronto', 'West Toronto'])].reset_index(drop=True)
city_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [43]:
print('The dataframe for all boroughs with the word `Toronto` has', city_data.shape[0], 'rows')

The dataframe for all boroughs with the word `Toronto` has 39 rows


In [44]:
city = 'Toronto, ON'
boroughs = ['Downtown Toronto', 'Central Toronto', 'East Toronto', 'West Toronto']

for borough in boroughs:
    address = borough + ', ' + city
    geolocator = Nominatim(user_agent="toronto_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geographical coordinate of ', borough, 'are {}, {}.'.format(latitude, longitude))

The geographical coordinate of  Downtown Toronto are 43.6563221, -79.3809161.
The geographical coordinate of  Central Toronto are 43.65238435, -79.38356765.
The geographical coordinate of  East Toronto are 43.626243, -79.396962.
The geographical coordinate of  West Toronto are 43.65238435, -79.38356765.


In [45]:
map_city = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(city_data['Latitude'], city_data['Longitude'], city_data['Borough'], city_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_city)  
    
map_city

The Foursquare API was used to to explore these neighborhoods and segment them.

In [69]:
CLIENT_ID = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' 
CLIENT_SECRET = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' 
ACCESS_TOKEN = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' 
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
CLIENT_SECRET:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


In [47]:
city_data.loc[0, 'Neighbourhood']

'The Beaches'

In [48]:
neighborhood_latitude = city_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = city_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = city_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


The following code gets the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [70]:
LIMIT = 100 
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&client_secret=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

A GET request was sent to obtain the desired information, which was then passed to a variable called ```results```. 

In [50]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ffa3e430bf2b44878110a47'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

All the information that is needed can be found in the _items_ key. The **get_category_type** function is used to extract the category of the venues found.

In [51]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

The following code obtains the required information about the venues and structure that information into a _pandas_ dataframe.

In [52]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869
4,Seaspray Restaurant,Asian Restaurant,43.678888,-79.298167


The following shows the number of venues returned by the Foursquare API.

In [53]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


The following function repeat the same process used on The Beachesto all the neighborhoods in the four Toronto boroughs.

In [54]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Running the ```getNearbyVenues``` prints out a list of all neighbourhoods found in the boroughs of Downtown Toronto, Central Toronto, East Toronto and West Toronto.

In [55]:
# type your answer here
city_venues = getNearbyVenues(names=city_data['Neighbourhood'], latitudes=city_data['Latitude'], longitudes=city_data['Longitude'])

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West,  Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High

The following code cell shows that the new ```city_venus``` dataframe has 1610 rows and 7 columns, while the ```.head()``` shows the first 5 rows.

In [56]:
print(city_venues.shape)
city_venues.head()

(1610, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Seaspray Restaurant,43.678888,-79.298167,Asian Restaurant


The number of venues returned for each neighbourhood is as shown below:

In [57]:
city_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,59,59,59,59,59,59
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,61,61,61,61,61,61
Christie,16,16,16,16,16,16
Church and Wellesley,78,78,78,78,78,78
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,7,7,7,7,7,7


The following cell shows the code to obtain the number of unique categories found.

In [58]:
print('There are {} uniques categories.'.format(len(city_venues['Venue Category'].unique())))

There are 235 uniques categories.


In order to segment and cluster the neighborhoods on a map of Toronto showing the four boroughs, the categorical values in the ```city_venues``` dataframe needs to be converted to numerical values using one-hot encoding. A new dataframe called ```city_onehot``` is created. 

In [59]:
city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

city_onehot['Neighbourhood'] = city_venues['Neighbourhood'] 

fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

city_onehot.head()

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
print('The new city_onehot dataframe has', city_onehot.shape[0], 'rows and', city_onehot.shape[1], 'columns')

The new city_onehot dataframe has 1610 rows and 236 columns


Following one-hot encoding of the dataframe, the rows are then grouped by neighborhood. The mean of the frequency of occurrence for each category in each neighbourhood is showed using the ```.mean()```.

In [61]:
city_grouped = city_onehot.groupby('Neighbourhood').mean().reset_index()
city_grouped

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016949,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.016393,0.0,0.016393
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,0.0,0.0,0.025641
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
print('The new city_onehot dataframe has', city_grouped.shape[0], 'rows and', city_grouped.shape[1], 'columns')

The new city_onehot dataframe has 39 rows and 236 columns


The following code shows the top 10 venues in each neighbourhood, sorted in descending order:

In [63]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = city_grouped['Neighbourhood']

for ind in np.arange(city_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Cheese Shop,Bakery,Beer Bar,Farmers Market,Restaurant,Seafood Restaurant,Park,Clothing Store
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Coffee Shop,Furniture / Home Store,Burrito Place,Restaurant,Stadium,Italian Restaurant,Intersection,Bar
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Yoga Studio,Garden,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Farmers Market
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Sculpture Garden,Airport Food Court,Airport Gate,Boat or Ferry,Boutique,Coffee Shop
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Japanese Restaurant,Salad Place,Thai Restaurant,Bubble Tea Shop,Burger Joint,Yoga Studio


In [64]:
print('The neighbourhoods_venues_sorted dataframe has', neighborhoods_venues_sorted.shape[0], 'rows and', neighborhoods_venues_sorted.shape[1], 'columns')

The neighbourhoods_venues_sorted dataframe has 39 rows and 11 columns


The *k*-Means algorithm is used to cluster all the neighbourhoods into 5 clusters.

In [65]:
kclusters = 5
city_grouped_clustering = city_grouped.drop('Neighbourhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

The following code checks cluster labels generated for each row in the dataframe.

In [66]:
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 0, 2, 2, 0, 2])

A new dataframe that includes the cluster value and the top 10 venues for each neighborhood is generated using the code cell below.

In [67]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_merged = city_data
city_merged = city_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

city_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Asian Restaurant,Pub,Trail,Neighborhood,Yoga Studio,Dog Run,Diner,Discount Store,Distribution Center
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Bookstore,Furniture / Home Store,Bubble Tea Shop,Indian Restaurant,Spa,Japanese Restaurant
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Fast Food Restaurant,Park,Brewery,Sandwich Place,Board Shop,Burrito Place,Restaurant,Italian Restaurant,Fish & Chips Shop,Steakhouse
3,M4M,East Toronto,Studio District,43.659526,-79.340923,2,Coffee Shop,Bakery,Gastropub,American Restaurant,Brewery,Café,Yoga Studio,Diner,Italian Restaurant,Bookstore
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,3,Dim Sum Restaurant,Park,Swim School,Bus Line,Yoga Studio,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop


The clusters can then be visualized on the map as below:

In [68]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged['Latitude'], city_merged['Longitude'], city_merged['Neighbourhood'], city_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters