# IBM Applied Data Science Capstone Project

## Segmenting and Clustering Neighborhoods in Toronto
### <font color='lightblue'> Peer Graded Assignment (Week 3) </font>

Explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information..

<div class="alert alert-block alert-info">
    <br>
    <b><font color='red'> Part 1 -  Webscraping: </font>  </b> Get Toronto postal code data from Wikipedia portal
    <br>
    <br>
</div>

Toronto postal codes start with "M" so use this link - https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
# Import necessary packages
import pandas as pd
import requests
from bs4 import BeautifulSoup

#### 1.1 geolocator = Nominatim(user_agent="ny_explorer")

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text

#### 1.2 Create soup object with the extracted html data

In [3]:
soup = BeautifulSoup(html_data, 'html.parser')

#### 1.3 Get index of the postal code table

In [4]:
# Select index of the postal code table
# loop through each table and select the table that has the string matches "M1A" postal code
tables = soup.find_all('table')
for i, t in enumerate(tables):
    if("M1A" in str(t)):
        table_index = i
        
# Index of the postal codes table
print('Index of postal codes table: ', table_index)

Index of postal codes table:  0


#### 1.4 Extract postal codes from html table

In [5]:
table_contents=[]
for row in tables[table_index].findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)
df_postal_codes = pd.DataFrame(table_contents)

#### 1.5 Clean and verify data

In [6]:
df_postal_codes['Borough']=df_postal_codes['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [7]:
# Validate data 
df_postal_codes.query('Borough == "" | Neighborhood == "" | Borough == "Not assigned" | Neighborhood == "Not assigned"')

Unnamed: 0,PostalCode,Borough,Neighborhood


In [8]:
df_test_codes = pd.DataFrame({'PostalCode':['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']})
df_test_codes.merge(df_postal_codes, on='PostalCode', how='left')

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


#### <font color='red'> ..... Part 1 output </font>

In [9]:
df_postal_codes.shape

(103, 3)

#### <font color='red'> ***** End of Part 1 *****</font>

<div class="alert alert-block alert-info">
    <br>
    <b><font color='red'> Part 2 - Latitude and Longitude of the postal codes: </font>  </b> Get coordinates from csv and add to our postal codes data frame
    <br>
    <br>
</div>

#### 2.1 Download coordinates csv file from coursera link

In [10]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
print('Geospatial Coordinates downloaded!')

Geospatial Coordinates downloaded!


#### 2.2 Read coordinates to a data frame

In [11]:
geo_coordinates = pd.read_csv('Geospatial_Coordinates.csv')

#### 2.3 Rename the column name of "Postal Code", there is no space in our original data frame

In [12]:
geo_coordinates.rename(columns={'Postal Code' : "PostalCode"}, inplace=True)
geo_coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### 2.4 Join postal codes and coordinates data frames into a new data frame

In [13]:
#### Join both data frames
df_postal_codes_with_ll = df_postal_codes.merge(geo_coordinates, on='PostalCode', how='left')

#### <font color='red'> ..... Part 2 output </font>

In [14]:
#### Test data and verify if it matches with the question
df_test_codes = pd.DataFrame({'PostalCode':['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']})
df_test_codes.merge(df_postal_codes_with_ll, on='PostalCode', how='left')

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750071,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


#### <font color='red'> ***** End of Part 2 *****</font>

<div class="alert alert-block alert-info">
    <br>
    <b><font color='red'> Part 3 - Explore and cluster the neighborhoods in Toronto. </font>  </b> 
    <br>
    <br>
</div>

In [15]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

print('Mapping Libraries imported.')

Mapping Libraries imported.


#### 3.1 Get longitude and latitude of Toronto 

In [16]:
# Get longitude and latitude of the Toronto
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="to_explorer4")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of', address, ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, ON, Canada  are 43.6534817, -79.3839347.


#### 3.2 Create Toronto map and add all the neighborhoos based on coordinates

In [17]:

# create map of North York neighborhoods using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_postal_codes_with_ll['Latitude'], df_postal_codes_with_ll['Longitude'], df_postal_codes_with_ll['Borough'], df_postal_codes_with_ll['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### 3.3 Lets further explore a Borough

#### Select the first Borough in the data

In [18]:
# Lets explore first Borough in Toronto postal codes data
df_postal_codes_with_ll.loc[0]

PostalCode             M3A
Borough         North York
Neighborhood     Parkwoods
Latitude         43.753259
Longitude       -79.329656
Name: 0, dtype: object

#### 3.4 Get coordinates for Norty York Borough

In [19]:
# Get longitude and latitude of the Borough
address = 'North York, Toronto, ON, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of', address, ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York, Toronto, ON, Canada  are 43.7543263, -79.44911696639593.


#### 3.5 Filter the data so that only North York coordinates are selected

In [20]:
# get north york data
north_york_data = df_postal_codes_with_ll[df_postal_codes_with_ll['Borough'] == 'North York'].reset_index(drop=True)
north_york_data.head(30)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073
5,M3C,North York,Don Mills South,43.7259,-79.340923
6,M2H,North York,Hillcrest Village,43.803762,-79.363452
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
9,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


#### 3.6 Draw the Toronto map again using North York coordinates and layer the North York neighborhoods in the map

In [21]:

# create map of North York neighborhoods using latitude and longitude values
map_nort_york = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(north_york_data['Latitude'], north_york_data['Longitude'], north_york_data['Borough'], north_york_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nort_york)  
    
map_nort_york

#### 3.7 Futher exploration can be done using Foursquare library

In [22]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30

#### 3.7.1 Let's explore the first neighborhood


Get the neighborhood's name.


In [23]:
north_york_data.loc[0, 'Neighborhood']

'Parkwoods'

Get the neighborhood's latitude and longitude values.


In [24]:
neighborhood_latitude = north_york_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = north_york_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = north_york_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


#### Now, let's get the top 100 venues that are in Parkwoods within a radius of 500 meters.


First, let's create the GET request URL. Name your URL **url**.


In [50]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
#url # display URL


Double-click **here** for the solution.

<!-- The correct answer is:
LIMIT = 100 # limit of number of venues returned by Foursquare API
-->

<!--
radius = 500 # define radius
-->

<!--
\\\\ # create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL
--> 


Send the GET request and examine the resutls


In [26]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60a823af0dd7e82825c2af90'},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA',
        'c

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [27]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [28]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,KFC,Fast Food Restaurant,43.754387,-79.333021
2,Variety Store,Food & Drink Shop,43.751974,-79.333114


And how many venues were returned by Foursquare?


In [29]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


<a id='item2'></a>


### 3.8 Explore Neighborhoods in North York


#### 3.8.1 Let's create a function to repeat the same process to all the neighborhoods in North York


In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### 3.8.2 Now write the code to run the above function on each neighborhood and create a new dataframe called _north_york_venues_.


In [31]:
north_york_venues = getNearbyVenues(names=north_york_data['Neighborhood'],
                                   latitudes=north_york_data['Latitude'],
                                   longitudes=north_york_data['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills North
Glencairn
Don Mills South
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview East
York Mills, Silver Hills
Downsview West
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview Central
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale South
Downsview Northwest
York Mills West
Willowdale West


#### Let's check the size of the resulting dataframe


In [32]:
print(north_york_venues.shape)
north_york_venues.head()

(246, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Let's check how many venues were returned for each neighborhood


In [33]:
north_york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Don Mills North,5,5,5,5,5,5
Don Mills South,23,23,23,23,23,23
Downsview Central,2,2,2,2,2,2
Downsview East,2,2,2,2,2,2
Downsview Northwest,5,5,5,5,5,5
Downsview West,5,5,5,5,5,5
"Fairview, Henry Farm, Oriole",65,65,65,65,65,65


#### Let's find out how many unique categories can be curated from all the returned venues


In [34]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 99 uniques categories.


<a id='item3'></a>


## 3. Analyze Each Neighborhood


In [35]:
# one hot encoding
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_york_onehot['Neighborhood'] = north_york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Sporting Goods Shop,Supermarket,Supplement Shop,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.


In [36]:
north_york_onehot.shape

(246, 100)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [37]:
north_york_grouped = north_york_onehot.groupby('Neighborhood').mean().reset_index()
north_york_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Sporting Goods Shop,Supermarket,Supplement Shop,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095238,...,0.0,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.04,0.04,0.0,0.04,0.0,0.0,0.0
3,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills South,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,...,0.043478,0.043478,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview East,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Fairview, Henry Farm, Oriole",0.015385,0.0,0.015385,0.0,0.0,0.015385,0.0,0.030769,0.030769,...,0.015385,0.0,0.015385,0.0,0.0,0.015385,0.015385,0.015385,0.0,0.030769


#### Let's confirm the new size


In [38]:
north_york_grouped.shape

(22, 100)

#### Let's print each neighborhood along with the top 5 most common venues


In [39]:
num_top_venues = 5

for hood in north_york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = north_york_grouped[north_york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North----
               venue  freq
0        Coffee Shop  0.10
1               Bank  0.10
2     Ice Cream Shop  0.05
3     Sandwich Place  0.05
4  Mobile Phone Shop  0.05


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0          Restaurant  0.08
1  Italian Restaurant  0.08
2         Coffee Shop  0.08
3      Sandwich Place  0.08
4    Greek Restaurant  0.04


----Don Mills North----
                  venue  freq
0                   Gym   0.2
1  Caribbean Restaurant   0.2
2                  Café   0.2
3   Japanese Restaurant   0.2
4          Dessert Shop   0.2


----Don Mills South----
            venue  freq
0     Coffee Shop  0.09
1      Restaurant  0.09
2             Gym  0.09
3  Sandwich Place  0.04
4       Bike Shop 

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [41]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = north_york_grouped['Neighborhood']

for ind in np.arange(north_york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(north_york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Ice Cream Shop,Sandwich Place,Mobile Phone Shop,Park,Diner,Pharmacy,Pizza Place,Restaurant
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Accessories Store,Lounge,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
2,"Bedford Park, Lawrence Manor East",Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Greek Restaurant,Grocery Store,Indian Restaurant,Juice Bar,Liquor Store,Locksmith
3,Don Mills North,Gym,Caribbean Restaurant,Café,Japanese Restaurant,Dessert Shop,Accessories Store,Lounge,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
4,Don Mills South,Coffee Shop,Restaurant,Gym,Sandwich Place,Bike Shop,Discount Store,Clothing Store,Chinese Restaurant,Bus Line,Bubble Tea Shop


<a id='item4'></a>


### 3.9. Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 5 clusters.


In [42]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

north_york_grouped_clustering = north_york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(north_york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 4, 1, 4, 1, 2, 0, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [43]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = north_york_data

# merge north_york_grouped with north_york_data to add latitude/longitude for each neighborhood
north_york_merged = north_york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

north_york_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Fast Food Restaurant,Liquor Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Coffee Shop,Hockey Arena,Pizza Place,Intersection,Financial or Legal Service,Portuguese Restaurant,Lounge,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Vietnamese Restaurant,Miscellaneous Shop,Coffee Shop,Women's Store,Bank,Pharmacy
3,M3B,North York,Don Mills North,43.745906,-79.352188,4.0,Gym,Caribbean Restaurant,Café,Japanese Restaurant,Dessert Shop,Accessories Store,Lounge,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
4,M6B,North York,Glencairn,43.709577,-79.445073,4.0,Playground,Bakery,Italian Restaurant,Japanese Restaurant,Park,Accessories Store,Locksmith,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant


Finally, let's visualize the resulting clusters


In [44]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
new_clusters = north_york_merged['Cluster Labels'].replace(np.nan,0) # replace nan with zeros to fix cluster label bug

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'], north_york_merged['Neighborhood'], new_clusters):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color = rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


### 3.10. Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [45]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 0, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0.0,Park,Food & Drink Shop,Fast Food Restaurant,Liquor Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio
11,North York,0.0,Park,Airport,Vietnamese Restaurant,Video Game Store,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio,Luggage Store
14,North York,0.0,Construction & Landscaping,Bakery,Basketball Court,Park,Accessories Store,Locksmith,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station
22,North York,0.0,Park,Convenience Store,Electronics Store,Liquor Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio


#### Cluster 2


In [46]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 1, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,1.0,Coffee Shop,Hockey Arena,Pizza Place,Intersection,Financial or Legal Service,Portuguese Restaurant,Lounge,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
2,North York,1.0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Vietnamese Restaurant,Miscellaneous Shop,Coffee Shop,Women's Store,Bank,Pharmacy
5,North York,1.0,Coffee Shop,Restaurant,Gym,Sandwich Place,Bike Shop,Discount Store,Clothing Store,Chinese Restaurant,Bus Line,Bubble Tea Shop
6,North York,1.0,Dog Run,Golf Course,Mediterranean Restaurant,Pool,Fast Food Restaurant,Korean Restaurant,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Massage Studio
7,North York,1.0,Coffee Shop,Bank,Ice Cream Shop,Sandwich Place,Mobile Phone Shop,Park,Diner,Pharmacy,Pizza Place,Restaurant
8,North York,1.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Women's Store,Shoe Store,Food Court,Japanese Restaurant,Juice Bar,Cosmetics Shop
9,North York,1.0,Caribbean Restaurant,Furniture / Home Store,Miscellaneous Shop,Metro Station,Bar,Massage Studio,Coffee Shop,Accessories Store,Lounge,Movie Theater
13,North York,1.0,Grocery Store,Park,Bank,Shopping Mall,Liquor Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant
18,North York,1.0,Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Greek Restaurant,Grocery Store,Indian Restaurant,Juice Bar,Liquor Store,Locksmith
20,North York,1.0,Ramen Restaurant,Sushi Restaurant,Pizza Place,Shopping Mall,Café,Coffee Shop,Liquor Store,Lounge,Fast Food Restaurant,Korean Restaurant


#### Cluster 3


In [47]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 2, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,North York,2.0,Food Truck,Baseball Field,Accessories Store,Liquor Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio
19,North York,2.0,Baseball Field,Accessories Store,Korean Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio,Luggage Store


#### Cluster 4


In [48]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 3, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,North York,3.0,Furniture / Home Store,Intersection,Shopping Mall,Accessories Store,Korean Restaurant,Miscellaneous Shop,Middle Eastern Restaurant,Metro Station,Mediterranean Restaurant,Massage Studio


#### Cluster 5


In [49]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 4, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,North York,4.0,Gym,Caribbean Restaurant,Café,Japanese Restaurant,Dessert Shop,Accessories Store,Lounge,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
4,North York,4.0,Playground,Bakery,Italian Restaurant,Japanese Restaurant,Park,Accessories Store,Locksmith,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
10,North York,4.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Accessories Store,Lounge,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant


#### <font color='red'> ***** End of Part 3 *****</font>