### Assignment week 3 - ***part 1***, IBM Data Science Specialization Capstone Project Course

#### Objectives:
###### > Scrape Wikipedia's website on Toronto's Postcodes, Boroughs and Neighbourhoods
###### > Put the scraped data into a dataframe
###### > Clean the dataframe

In [1]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [2]:
#define response by sending get-request in order to obtain the information on the website
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
#check if the information from the website was downloaded correctly (output should be 200, if not: an error occurs)
res.status_code

200

In [4]:
#print the first 1000 characters of the html
print(res.text[0:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XgTffQpAIDAAAINV6F4AAADY","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":932531537,"wgRevisionId":932531537,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario

In [5]:
#create an object to parse the text from the webpage
soup = BeautifulSoup(res.text, 'lxml')

In [6]:
#extract the table rows (<tr>) and check the length to see if the amount is reasonable considering the table on the original webpage
results = soup.find_all('tr')      
len(results)

293

In [7]:
#take a look at the results to observe the html-structure of the rows
results[0:5]

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighborhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>]

In [8]:
#access the table by searching the soup-object for attribute 'table' of class 'wikitable sortable'
Wiki_table = soup.find_all('table',class_='wikitable sortable')

In [9]:
#check if we really have only 1 table on our webpage
len(Wiki_table)

1

In [10]:
#check the element type
type(Wiki_table)

bs4.element.ResultSet

In [11]:
#extract the Tag element [0] because it's the only element in the Wiki_table
Wiki_table = Wiki_table[0]
type(Wiki_table)

bs4.element.Tag

In [12]:
#grab the column names from the Wiki_table
Toronto_columns = []

for row in Wiki_table.find_all('tr'):
    for col in row.find_all('th'): 
        Toronto_columns.append(col.text.strip())

print('the columns in the table are:', Toronto_columns)

the columns in the table are: ['Postcode', 'Borough', 'Neighborhood']


In [13]:
#put the scraped data into an array
Toronto_array = []

for row in Wiki_table.find_all('tr'):
    row_postcode     = ''
    row_borough      = ''
    row_neighborhood = ''
    col_count        = 0
    
    for col in row.find_all('td'):
        col_count    += 1
        if col_count == 1: row_postcode = col.text.strip()
        if col_count == 2: row_borough  = col.text.strip()
        if col_count == 3: 
            row_neighborhood   = col.text.strip()

            Toronto_array.append ([row_postcode, row_borough,row_neighborhood])

In [14]:
#check the array
Toronto_array[0:10]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Downtown Toronto', "Queen's Park"]]

In [15]:
#put the data in the array into a pandas dataframe and take a look
Toronto_raw_df = pd.DataFrame(data=Toronto_array, columns = Toronto_columns)
Toronto_raw_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [16]:
Toronto_raw_df.tail()

Unnamed: 0,Postcode,Borough,Neighborhood
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor
286,M9Z,Not assigned,Not assigned


In [17]:
Toronto_raw_df.shape

(287, 3)

In [18]:
#identify the Boroughs with a 'Not assigned' value and drop these rows
NA_boroughs = Toronto_raw_df.index[Toronto_raw_df['Borough'] == 'Not assigned']

Toronto_raw_df.drop(Toronto_raw_df.index[NA_boroughs], inplace=True)
Toronto_raw_df.reset_index(drop=True, inplace=True)

Toronto_raw_df.head(6)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Queen's Park,Not assigned


In [19]:
#replace the 'Not assigned' values in column Neighbourhood with Borough name from Borough column
NA_neighbourhoods = Toronto_raw_df.index[Toronto_raw_df['Neighborhood'] == 'Not assigned']

for idx in NA_neighbourhoods:
    Toronto_raw_df['Neighborhood'][idx] = Toronto_raw_df['Borough'][idx]
    
Toronto_cleaned_df = Toronto_raw_df

In [20]:
Toronto_cleaned_df.head(6)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Queen's Park,Queen's Park


In [21]:
#sort the Postcodes to show different Neighbourhoods share the same Postcode
Toronto_sorted_df = Toronto_cleaned_df.sort_values(by='Postcode', ascending = True)
Toronto_sorted_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
22,M1C,Scarborough,Port Union
21,M1C,Scarborough,Rouge Hill
20,M1C,Scarborough,Highland Creek
32,M1E,Scarborough,Guildwood
33,M1E,Scarborough,Morningside
34,M1E,Scarborough,West Hill
38,M1G,Scarborough,Woburn
42,M1H,Scarborough,Cedarbrae


In [22]:
print('There are {} different neighbourhoods'.format(Toronto_sorted_df['Neighborhood'].unique().shape[0]))
print('and {} unique Postcodes'.format(Toronto_sorted_df['Postcode'].unique().shape[0]))

There are 207 different neighbourhoods
and 103 unique Postcodes


In [23]:
#Aggregate the Neighbourhoods with similar Postcodes, shape of the dataframe is supposed to be (103, 3) afterwards
Toronto_grouped_df = Toronto_sorted_df.groupby(['Postcode', 'Borough']).agg(', '.join)

Toronto_df = Toronto_grouped_df.reset_index()
Toronto_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea"
8,M1M,Scarborough,"Cliffcrest, Scarborough Village West, Cliffside"
9,M1N,Scarborough,"Cliffside West, Birch Cliff"


In [24]:
Toronto_df.shape

(103, 3)

### **End of part 1**

**--------------------------------------------------------------------------------------------------------------------------------------**

### Assignment week 3 - ***part 2***, IBM Data Science Specialization Capstone Project Course

#### Objectives:
###### > Get the geographical coordinates of each Toronto neighbourhood
###### > Add latitude and longitude data in columns to the existing dataframe

In [26]:
#download the geographical coordinates and put them in a dataframe
ll_url = 'https://cocl.us/Geospatial_data'

ll_df = pd.read_csv(ll_url)
ll_df.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [27]:
#sort the dataframe by Postal Code
ll_sorted_df = ll_df.sort_values(by='Postal Code', ascending = True)
ll_sorted_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [28]:
#add the coordinates to the dataframe using a join with the Postcode/Postal Code as key
Toronto_ll = Toronto_df.join(ll_df.set_index('Postal Code'), on='Postcode')
Toronto_ll.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Scarborough Village West, Cliffside",43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West, Birch Cliff",43.692657,-79.264848


In [29]:
Toronto_ll.shape

(103, 5)

### **End of part 2**

**--------------------------------------------------------------------------------------------------------------------------------------**

### Assignment week 3 - ***part 3***, IBM Data Science Specialization Capstone Project Course

#### Objectives:
###### > Explore and cluster the neighborhoods of Toronto
###### > Analyse the information
###### > Explain what you do, report any observations you make and generate maps to visualize the neighbourhoods and their clusters

In [30]:
#download Folium for visualization
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-4.0.0               |             py_0         606 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following NEW packages will be 

In [31]:
import folium
import json
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

In [32]:
#define an instance of geocoder and get the coordinates of Toronto
address = 'Toronto, ON'
geolocator = Nominatim(user_agent = 'toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are: lat = {}, long = {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are: lat = 43.653963, long = -79.387207.


In [33]:
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start = 11, tiles = 'Stamen Terrain')

for lat, lng, borough, neighbourhood in zip(Toronto_ll['Latitude'], Toronto_ll['Longitude'],Toronto_ll['Borough'], Toronto_ll['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng], radius = 5, popup = label,
        color = 'green', fill = True, fill_color = 'yellow',
        fill_opacity = 0.6, parse_html = False).add_to(map_Toronto)

map_Toronto

In [34]:
client_id = 'XT3Q5AEWFGANGCMJJXJTGPFKVIL1RRCUARX4XN0550QK3BLK'
client_secret = 'BB0L40BDLBJMJROBYGSLPRULZ31QICHJXAYTXX4LWJSJPDSE'
version = '20180605'

In [35]:
B = Toronto_ll['Borough'].unique()
B

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)

In [36]:
CT_data=Toronto_ll[Toronto_ll['Borough']=='Central Toronto'].reset_index(drop = True)
CT_data

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,M4T,Central Toronto,"Summerhill East, Moore Park",43.689574,-79.38316
5,M4V,Central Toronto,"South Hill, Summerhill West, Rathnelly, Forest...",43.686412,-79.400049
6,M5N,Central Toronto,Roselawn,43.711695,-79.416936
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
8,M5R,Central Toronto,"Yorkville, The Annex, North Midtown",43.67271,-79.405678


In [39]:
#get the coordinates of Central Toronto
address = 'Central Toronto, Toronto'
geolocator = Nominatim(user_agent = 'toronto_explorer')
location = geolocator.geocode(address)
lat = location.latitude
long = location.longitude
print('The coordinates of Central Toronto are: {}, {}.'.format(lat, long))

The coordinates of Central Toronto are: 43.653963, -79.387207.


In [40]:
#visualize a map of Central Toronto's neighbourhoods
map_Central_Toronto = folium.Map(height=700, width=500, location = [lat, lng], zoom_start = 11.5, tiles = 'Stamen Terrain')

for lat, lng, label in zip(CT_data['Latitude'], CT_data['Longitude'], CT_data['Neighborhood']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker([lat, lng], radius = 5,
                       popup = label, color = 'blue',
                       fill = True, fill_color = 'yellow',
                       fill_opacity = 0.7, parse_html = False).add_to(map_Central_Toronto)

map_Central_Toronto

In [41]:
#explore the first neighbourhood in Central Toronto
print('The 1st neighbourhood in Central Toronto is:', CT_data.loc[0, 'Neighborhood'])

The 1st neighbourhood in Central Toronto is: Lawrence Park


In [42]:
#get the coordinates of Lawrence Park
neighbourhood_latitude = CT_data.loc[0, 'Latitude']
neighbourhood_longitude = CT_data.loc[0, 'Longitude']
neighbourhood_name = CT_data.loc[0, 'Neighborhood']
print('The coordinates of {} are: lat {} and long {}'.format(neighbourhood_name, neighbourhood_latitude, neighbourhood_longitude))

The coordinates of Lawrence Park are: lat 43.7280205 and long -79.3887901


In [43]:
#get the venues that are in Lawrence Park within a radius of 750m
radius = 750
limit = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(client_id, client_secret, neighbourhood_latitude, neighbourhood_longitude, version, radius, limit)
url

'https://api.foursquare.com/v2/venues/explore?client_id=XT3Q5AEWFGANGCMJJXJTGPFKVIL1RRCUARX4XN0550QK3BLK&client_secret=BB0L40BDLBJMJROBYGSLPRULZ31QICHJXAYTXX4LWJSJPDSE&ll=43.7280205,-79.3887901&v=20180605&radius=750&limit=100'

In [44]:
results = requests.get(url).json()

In [45]:
#extract the categories of the venues
def get_category_type(row):
    try:
        categories_list=row['categories']
    except:
        categories_list=row['venue.categories']
    
    if len(categories_list)==0:
        return None
    else:
        return categories_list[0]['name']

In [46]:
#get the information we need from the json file and change it into a dataframe
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues)

filteres_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,count,items,referralId,categories,id,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,name,count.1,groups
0,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-50e6da19e4b0d8a78a0e9794-0,Park,50e6da19e4b0d8a78a0e9794,3055 Yonge Street,CA,Toronto,Canada,Lawrence Avenue East,465,"[3055 Yonge Street (Lawrence Avenue East), Tor...","[{'label': 'display', 'lat': 43.72696303913755...",43.726963,-79.394382,,ON,Lawrence Park Ravine,0,[]
1,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-5968d757a6031c5daae7f8c5-1,Photography Studio,5968d757a6031c5daae7f8c5,"1655 Dupont st., 204A",CA,Toronto,Canada,,268,"[1655 Dupont st., 204A, Toronto ON M6P 3S9, Ca...","[{'label': 'display', 'lat': 43.73042924043832...",43.730429,-79.388767,M6P 3S9,ON,The Photo School – Toronto,0,[]
2,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-50a14c44e4b0bbf777f5152c-2,Coffee Shop,50a14c44e4b0bbf777f5152c,2275 Bayview Ave,CA,Toronto,Canada,,746,"[2275 Bayview Ave, Toronto ON M4N 3M6, Canada]","[{'label': 'display', 'lat': 43.72732441416303...",43.727324,-79.379563,M4N 3M6,ON,Tim Hortons,0,[]
3,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-5082ef77e4b0a7491cf7b022-3,Swim School,5082ef77e4b0a7491cf7b022,,CA,,Canada,,480,[Canada],"[{'label': 'display', 'lat': 43.72853205765438...",43.728532,-79.38286,,,Zodiac Swim School,0,[]
4,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-50ed9da8e4b081eabee12672-4,Bus Line,50ed9da8e4b081eabee12672,,CA,Toronto,Canada,,481,"[Toronto ON, Canada]","[{'label': 'display', 'lat': 43.72802605799448...",43.728026,-79.382805,,ON,TTC Bus #162 - Lawrence-Donway,0,[]


In [47]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

7 venues were returned by Foursquare.


In [48]:
#create a function to repeat the same process to all the neighbourhoods in Central Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=750, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            version, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    Central_Toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    Central_Toronto_venues.columns = ['Neighbourhood', 
                  'Neighbourhood lat', 
                  'Neighbourhood lng', 
                  'Venue', 
                  'Venue lat', 
                  'Venue lng', 
                  'Venue Category']
    
    return(Central_Toronto_venues)

In [49]:
#create a new dataframe and run the above function on each neighbourhood
Central_Toronto_venues = getNearbyVenues(names=CT_data['Neighborhood'],
                                   latitudes=CT_data['Latitude'],
                                   longitudes=CT_data['Longitude']
                                  )

Lawrence Park
Davisville North
North Toronto West
Davisville
Summerhill East, Moore Park
South Hill, Summerhill West, Rathnelly, Forest Hill SE, Deer Park
Roselawn
Forest Hill North, Forest Hill West
Yorkville, The Annex, North Midtown


In [50]:
#check the dataframe
print('Shape Central Toronto venues dataframe is:', Central_Toronto_venues.shape)
Central_Toronto_venues.head(10)

Shape Central Toronto venues dataframe is: (317, 7)


Unnamed: 0,Neighbourhood,Neighbourhood lat,Neighbourhood lng,Venue,Venue lat,Venue lng,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,The Photo School – Toronto,43.730429,-79.388767,Photography Studio
2,Lawrence Park,43.72802,-79.38879,Tim Hortons,43.727324,-79.379563,Coffee Shop
3,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
4,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
5,Lawrence Park,43.72802,-79.38879,Ingledew Invites,43.724715,-79.395275,Business Service
6,Lawrence Park,43.72802,-79.38879,ZayZay Shop,43.727631,-79.396799,Business Service
7,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park
8,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop
9,Davisville North,43.712751,-79.390197,Istanbul Cafe & Espresso Bar,43.707891,-79.393049,Café


In [51]:
#check how many venues were returned for each neighbourhood
Central_Toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood lat,Neighbourhood lng,Venue,Venue lat,Venue lng,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Davisville,68,68,68,68,68,68
Davisville North,31,31,31,31,31,31
"Forest Hill North, Forest Hill West",7,7,7,7,7,7
Lawrence Park,7,7,7,7,7,7
North Toronto West,43,43,43,43,43,43
Roselawn,5,5,5,5,5,5
"South Hill, Summerhill West, Rathnelly, Forest Hill SE, Deer Park",60,60,60,60,60,60
"Summerhill East, Moore Park",14,14,14,14,14,14
"Yorkville, The Annex, North Midtown",82,82,82,82,82,82


In [52]:
#show how many unique categories there are in the returned venues
print('There are {} unique categories.'.format(len(Central_Toronto_venues['Venue Category'].unique())))

There are 112 unique categories.


In [53]:
#define dataframe for one-hot encoding
CT_onehot = pd.get_dummies(Central_Toronto_venues[['Venue Category']],prefix="", prefix_sep="")

In [54]:
#add neighbourhood column back to dataframe
CT_onehot['Neighbourhood'] = Central_Toronto_venues['Neighbourhood']

In [55]:
#move neighbourhood column to first column
fixed_columns = [CT_onehot.columns[-1]]+list(CT_onehot.columns[:-1])
CT_onehot = CT_onehot[fixed_columns]

print(CT_onehot.shape)

(317, 113)


In [56]:
#take a look at the one-hot encoded dataframe
CT_onehot.head(10)

Unnamed: 0,Neighbourhood,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Breakfast Spot,Brewery,...,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Track,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
#group the rows by neighbourhood and by taking the mean of the frequency occurance of each category
CT_grouped = CT_onehot.groupby('Neighbourhood').mean().reset_index()

In [58]:
#confirm the new size
CT_grouped.shape

(9, 113)

In [59]:
#print each neighbourhood along with the top 5 most common venues
nr_top_venues = 5

for nbh in CT_grouped['Neighbourhood']:
    print('*'+nbh+'*')
    temp = CT_grouped[CT_grouped['Neighbourhood']==nbh].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq':2})

    print(temp.sort_values('freq', ascending = False).reset_index(drop=True).head(nr_top_venues))
    print('\n')
                                                                                  

*Davisville*
                venue  freq
0  Italian Restaurant  0.09
1         Coffee Shop  0.07
2         Pizza Place  0.04
3      Sandwich Place  0.04
4                 Gym  0.04


*Davisville North*
         venue  freq
0  Coffee Shop  0.10
1         Café  0.06
2          Gym  0.06
3         Park  0.06
4  Pizza Place  0.06


*Forest Hill North, Forest Hill West*
              venue  freq
0  Asian Restaurant  0.14
1             Trail  0.14
2              Park  0.14
3  Sushi Restaurant  0.14
4          Bus Line  0.14


*Lawrence Park*
                venue  freq
0    Business Service  0.29
1  Photography Studio  0.14
2         Coffee Shop  0.14
3                Park  0.14
4            Bus Line  0.14


*North Toronto West*
                 venue  freq
0          Coffee Shop  0.09
1  Sporting Goods Shop  0.09
2                 Café  0.07
3         Skating Rink  0.05
4   Italian Restaurant  0.05


*Roselawn*
                 venue  freq
0           Playground   0.4
1         Home Service

In [60]:
#put the above in a dataframe

##write a function to sort the venues in descending order
def return_most_common_venues(row, nr_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    return row_categories_sorted.index.values[0:nr_top_venues]

In [61]:
##create a new dataframe and display the top 10 venues for each neighbourhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = CT_grouped['Neighbourhood']

for ind in np.arange(CT_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(CT_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Italian Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Café,Dessert Shop,Gym,Fast Food Restaurant,Restaurant,Indian Restaurant
1,Davisville North,Coffee Shop,Café,Park,Pizza Place,Gym,Sushi Restaurant,Dessert Shop,Pharmacy,Food & Drink Shop,Sandwich Place
2,"Forest Hill North, Forest Hill West",Sushi Restaurant,Dry Cleaner,Bus Line,Jewelry Store,Park,Trail,Asian Restaurant,Falafel Restaurant,Farmers Market,Gaming Cafe
3,Lawrence Park,Business Service,Photography Studio,Park,Coffee Shop,Bus Line,Swim School,Gaming Cafe,Dry Cleaner,Falafel Restaurant,Farmers Market
4,North Toronto West,Coffee Shop,Sporting Goods Shop,Café,Restaurant,Bakery,Skating Rink,Clothing Store,Diner,Italian Restaurant,Wings Joint
5,Roselawn,Playground,Home Service,Garden,Business Service,Gym Pool,Gym / Fitness Center,Dog Run,History Museum,Dry Cleaner,Falafel Restaurant
6,"South Hill, Summerhill West, Rathnelly, Forest...",Coffee Shop,Sushi Restaurant,Italian Restaurant,Pharmacy,Restaurant,Sandwich Place,Skating Rink,Gym,Pub,Pizza Place
7,"Summerhill East, Moore Park",Park,Grocery Store,Japanese Restaurant,Sandwich Place,Candy Store,Gym / Fitness Center,Gym,Café,Playground,Thai Restaurant
8,"Yorkville, The Annex, North Midtown",Coffee Shop,Pizza Place,Park,Sandwich Place,Vegetarian / Vegan Restaurant,Pub,Gym,Grocery Store,History Museum,Diner


In [62]:
#clustering the neighbourhoods using K-means
kclusters = 5
CT_grouped_clustering = CT_grouped.drop('Neighbourhood', 1)

kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(CT_grouped_clustering)
kmeans.labels_[0:10]

array([3, 3, 4, 2, 3, 0, 3, 1, 3], dtype=int32)

In [63]:
#create a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

CT_merged = CT_data

In [64]:
#merge CT_grouped with CT_data to add latitude/longitude for each neighbourhood
CT_merged = CT_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on = 'Neighborhood')
CT_merged

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Business Service,Photography Studio,Park,Coffee Shop,Bus Line,Swim School,Gaming Cafe,Dry Cleaner,Falafel Restaurant,Farmers Market
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197,3,Coffee Shop,Café,Park,Pizza Place,Gym,Sushi Restaurant,Dessert Shop,Pharmacy,Food & Drink Shop,Sandwich Place
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,3,Coffee Shop,Sporting Goods Shop,Café,Restaurant,Bakery,Skating Rink,Clothing Store,Diner,Italian Restaurant,Wings Joint
3,M4S,Central Toronto,Davisville,43.704324,-79.38879,3,Italian Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Café,Dessert Shop,Gym,Fast Food Restaurant,Restaurant,Indian Restaurant
4,M4T,Central Toronto,"Summerhill East, Moore Park",43.689574,-79.38316,1,Park,Grocery Store,Japanese Restaurant,Sandwich Place,Candy Store,Gym / Fitness Center,Gym,Café,Playground,Thai Restaurant
5,M4V,Central Toronto,"South Hill, Summerhill West, Rathnelly, Forest...",43.686412,-79.400049,3,Coffee Shop,Sushi Restaurant,Italian Restaurant,Pharmacy,Restaurant,Sandwich Place,Skating Rink,Gym,Pub,Pizza Place
6,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Playground,Home Service,Garden,Business Service,Gym Pool,Gym / Fitness Center,Dog Run,History Museum,Dry Cleaner,Falafel Restaurant
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307,4,Sushi Restaurant,Dry Cleaner,Bus Line,Jewelry Store,Park,Trail,Asian Restaurant,Falafel Restaurant,Farmers Market,Gaming Cafe
8,M5R,Central Toronto,"Yorkville, The Annex, North Midtown",43.67271,-79.405678,3,Coffee Shop,Pizza Place,Park,Sandwich Place,Vegetarian / Vegan Restaurant,Pub,Gym,Grocery Store,History Museum,Diner


In [65]:
#Visualize the resulting clusters
map_CT_clusters =folium.Map(location = [latitude, longitude], zoom_start = 12, width = 700, tiles ='Stamen Toner')

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range (kclusters)]
colors_array = cm.gist_rainbow(np.linspace(0, 1, len(ys)))
gist_rainbow = [colors.rgb2hex(i)for i in colors_array]

marker_colors = []
for lat, lon, poi, cluster in zip(CT_merged['Latitude'], CT_merged['Longitude'], CT_merged['Neighborhood'], CT_merged[ 'Cluster Labels']):
    label = folium.Popup(str(poi)+'Cluster'+str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=6, popup=label, color=gist_rainbow[cluster -1], fill=True, fill_color=gist_rainbow[cluster -1],fill_opacity=0.6).add_to(map_CT_clusters)
    
map_CT_clusters

In [66]:
#examine the cluster
CT_merged.loc[CT_merged['Cluster Labels']==0, CT_merged.columns[[1]+list(range(5, CT_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Central Toronto,0,Playground,Home Service,Garden,Business Service,Gym Pool,Gym / Fitness Center,Dog Run,History Museum,Dry Cleaner,Falafel Restaurant


In [67]:
CT_merged.loc[CT_merged['Cluster Labels']==1, CT_merged.columns[[1]+list(range(5, CT_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,1,Park,Grocery Store,Japanese Restaurant,Sandwich Place,Candy Store,Gym / Fitness Center,Gym,Café,Playground,Thai Restaurant


In [68]:
CT_merged.loc[CT_merged['Cluster Labels']==2, CT_merged.columns[[1]+list(range(5, CT_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,2,Business Service,Photography Studio,Park,Coffee Shop,Bus Line,Swim School,Gaming Cafe,Dry Cleaner,Falafel Restaurant,Farmers Market


In [69]:
CT_merged.loc[CT_merged['Cluster Labels']==3, CT_merged.columns[[1]+list(range(5, CT_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Central Toronto,3,Coffee Shop,Café,Park,Pizza Place,Gym,Sushi Restaurant,Dessert Shop,Pharmacy,Food & Drink Shop,Sandwich Place
2,Central Toronto,3,Coffee Shop,Sporting Goods Shop,Café,Restaurant,Bakery,Skating Rink,Clothing Store,Diner,Italian Restaurant,Wings Joint
3,Central Toronto,3,Italian Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Café,Dessert Shop,Gym,Fast Food Restaurant,Restaurant,Indian Restaurant
5,Central Toronto,3,Coffee Shop,Sushi Restaurant,Italian Restaurant,Pharmacy,Restaurant,Sandwich Place,Skating Rink,Gym,Pub,Pizza Place
8,Central Toronto,3,Coffee Shop,Pizza Place,Park,Sandwich Place,Vegetarian / Vegan Restaurant,Pub,Gym,Grocery Store,History Museum,Diner


In [70]:
CT_merged.loc[CT_merged['Cluster Labels']==4, CT_merged.columns[[1]+list(range(5, CT_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Central Toronto,4,Sushi Restaurant,Dry Cleaner,Bus Line,Jewelry Store,Park,Trail,Asian Restaurant,Falafel Restaurant,Farmers Market,Gaming Cafe


### **End of part 3**

## Thanks for reviewing!