# IBM Capstone Project

## Segmenting and Clustering Neighborhoods in Toronto

### Part 1

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe:

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

import json 
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
import csv

  import pandas.util.testing as tm


### Scrape wikipedia data and create a clean dataframe. 

1. The dataframe should consist of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. 
3. Ignore cells with a borough that is Not assigned.
4. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
5. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
6. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
7. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [2]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(source, 'lxml')

table = soup.find("table")
table_rows = table.tbody.find_all("tr")

res = []
for tr in table_rows:
    td = tr.find_all("td")
    row = [tr.text for tr in td]
    
    # Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if row != [] and row[1] != "Not assigned":
        # If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.
        if "Not assigned" in row[2]: 
            row[2] = row[1]
        res.append(row)

# Dataframe with 3 columns
df = pd.DataFrame(res, columns = ["PostalCode", "Borough", "Neighborhood"])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [3]:
df["Neighborhood"] = df["Neighborhood"].str.replace("\n","")
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned
1,M2A\n,Not assigned\n,Not assigned
2,M3A\n,North York\n,Parkwoods
3,M4A\n,North York\n,Victoria Village
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront"


In [4]:
df = df.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(", ".join).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned
1,M1B\n,Scarborough\n,"Malvern, Rouge"
2,M1C\n,Scarborough\n,"Rouge Hill, Port Union, Highland Creek"
3,M1E\n,Scarborough\n,"Guildwood, Morningside, West Hill"
4,M1G\n,Scarborough\n,Woburn


In [5]:
print("Shape: ", df.shape)

Shape:  (180, 3)


In [6]:
#drop unassigned values from dataframe
df.drop(df[df['Borough']=="Not assigned"].index,axis=0, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned
1,M1B\n,Scarborough\n,"Malvern, Rouge"
2,M1C\n,Scarborough\n,"Rouge Hill, Port Union, Highland Creek"
3,M1E\n,Scarborough\n,"Guildwood, Morningside, West Hill"
4,M1G\n,Scarborough\n,Woburn


# IBM Capstone Project

### Part 2

### Getting the Geo-data from the four-square API

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

To do this we could leverage the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

In [7]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

#wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#raw_wikipedia_page= requests.get(wikipedia_link).text

#soup = BeautifulSoup(raw_wikipedia_page,'lxml')
#print(soup.prettify())

df_geo_coor = pd.read_csv("data/Geospatial_Coordinates.csv")
df_geo_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### Merge the two dataframes:

In [17]:
df_toronto = pd.concat([df, df_geo_coor], axis=1)
df_toronto.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1A\n,Not assigned\n,Not assigned,M1B,43.806686,-79.194353
1,M1B\n,Scarborough\n,"Malvern, Rouge",M1C,43.784535,-79.160497
2,M1C\n,Scarborough\n,"Rouge Hill, Port Union, Highland Creek",M1E,43.763573,-79.188711
3,M1E\n,Scarborough\n,"Guildwood, Morningside, West Hill",M1G,43.770992,-79.216917
4,M1G\n,Scarborough\n,Woburn,M1H,43.773136,-79.239476
5,M1H\n,Scarborough\n,Cedarbrae,M1J,43.744734,-79.239476
6,M1J\n,Scarborough\n,Scarborough Village,M1K,43.727929,-79.262029
7,M1K\n,Scarborough\n,"Kennedy Park, Ionview, East Birchmount Park",M1L,43.711112,-79.284577
8,M1L\n,Scarborough\n,"Golden Mile, Clairlea, Oakridge",M1M,43.716316,-79.239476
9,M1M\n,Scarborough\n,"Cliffside, Cliffcrest, Scarborough Village West",M1N,43.692657,-79.264848


In [9]:
#df_toronto = pd.merge(df, df_geo_coor, how='left', left_on = 'PostalCode', right_on = 'Postal Code')
# remove the "Postal Code" column
#df_toronto.drop("Postal Code", axis=1, inplace=True)
#df_toronto.head()

In [26]:
df_toronto.dropna(subset = ["Latitude", "Longitude", "Borough","Neighborhood"], inplace=True)

In [27]:
df_toronto.shape

(103, 6)

In [28]:
from  geopy.geocoders import Nominatim
import folium

In [29]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto city are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto city are 43.6534817, -79.3839347.


### Create several maps with the Toronto City data superimposed.

In [30]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

#### Let's now add markers to the map

In [35]:
for lat, lng, borough, neighborhood in zip(
        df_toronto['Latitude'], 
        df_toronto['Longitude'], 
        df_toronto['Borough'], 
        df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

Create a dataframe that contains only values with Toronto in their name, e.g "East and Central Toronto".

In [32]:
# Create a dataframe for only Borough's that contain Toronto
df_toronto_1 = df_toronto[df_toronto['Borough'].str.contains("Toronto")].reset_index(drop=True)
df_toronto_1.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M4E\n,East Toronto\n,The Beaches,M5N,43.711695,-79.416936
1,M4K\n,East Toronto\n,"The Danforth West, Riverdale",M5T,43.653206,-79.400049
2,M4L\n,East Toronto\n,"India Bazaar, The Beaches West",M5V,43.628947,-79.39442
3,M4M\n,East Toronto\n,Studio District,M5W,43.646435,-79.374846
4,M4N\n,Central Toronto\n,Lawrence Park,M5X,43.648429,-79.38228
5,M4P\n,Central Toronto\n,Davisville North,M6A,43.718518,-79.464763
6,M4R\n,Central Toronto\n,"North Toronto West, Lawrence Park",M6B,43.709577,-79.445073
7,M4S\n,Central Toronto\n,Davisville,M6C,43.693781,-79.428191
8,M4T\n,Central Toronto\n,"Moore Park, Summerhill East",M6E,43.689026,-79.453512
9,M4V\n,Central Toronto\n,"Summerhill West, Rathnelly, South Hill, Forest...",M6G,43.669542,-79.422564


In [37]:
map_toronto_1 = folium.Map(location=[latitude, longitude], zoom_start=12)
for lat, lng, borough, neighborhood in zip(
        df_toronto_1['Latitude'], 
        df_toronto_1['Longitude'], 
        df_toronto_1['Borough'], 
        df_toronto_1['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_1)  

map_toronto_1

## Exploring and Segmenting Toronto Neighborhood data

We are going to use the four square API call to get data on places of interest in Toronto. The API will return the values in a JSON format, which requires cleaning for passing into a dataframe.

In [73]:
from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe

##### Define Four Square credentials:

In [74]:
CLIENT_ID = 'K2D5QVKQ0QPUPIKECQU2JIB3CHVFT4ZMTJQC5WVTRNYTDD5Z'
CLIENT_SECRET = '5ZVFLDC04W0W5SHCBZ11MDRG0MELPIER3QIFUOAWFAIZKJNK'
VERSION = '20200808'

In [75]:
neighborhood_name = df_toronto_1.loc[0, 'Neighborhood']
print(f"The first neighborhood's name is '{neighborhood_name}'.")

The first neighborhood's name is 'The Beaches'.


In [76]:
neighborhood_latitude = df_toronto_denc.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_toronto_denc.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.7116948, -79.41693559999999.


In [77]:
LIMIT = 300 # limit of number of venues returned by Foursquare API
radius = 1000 # define the radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# get the result to a json file
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ed61520c546f3001b8ddc05'},
 'response': {'headerLocation': 'Lawrence Park South',
  'headerFullLocation': 'Lawrence Park South, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 25,
  'suggestedBounds': {'ne': {'lat': 43.72069480900001,
    'lng': -79.40450770787952},
   'sw': {'lat': 43.702694790999985, 'lng': -79.42936349212046}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4fdc0e98e4b05197cd14912b',
       'name': 'The Abbot',
       'location': {'address': '508 Eglinton Ave W',
        'crossStreet': 'Heddington Ave',
        'lat': 43.703687730512435,
        'lng': -79.41348481516249,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.703687730512435,
          'lng': -79.4134848151

Create a **get category type** function that will extract values that fit established category.

In [78]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean data and pass it into a pandas dataframe.

In [79]:
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

  


Unnamed: 0,name,categories,lat,lng
0,The Abbot,Gastropub,43.703688,-79.413485
1,Hotel Gelato,Café,43.703478,-79.414311
2,The Mad Bean Coffee House,Coffee Shop,43.703529,-79.413698
3,7 Numbers,Italian Restaurant,43.70363,-79.413724
4,Ferraro,Italian Restaurant,43.703655,-79.413167
5,Tokyo Sushi,Sushi Restaurant,43.704146,-79.410631
6,Phipps Bakery Cafe,Bakery,43.704116,-79.411135
7,EDO,Japanese Restaurant,43.703754,-79.412802
8,Forest Hill Arena,Skating Rink,43.704289,-79.420367
9,Starbucks,Coffee Shop,43.704171,-79.411887


Create a similar function as we did above to explore venues in Toronto city. We will use the four square API to extract this data. After this we will combine the data output from this function with the dataframe from earlier that contains only values with Toronto.

In [80]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [82]:
toronto_1_venues = getNearbyVenues(names=df_toronto_1['Neighborhood'],
                                   latitudes=df_toronto_1['Latitude'],
                                   longitudes=df_toronto_1['Longitude']
                                  )
toronto_1_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.711695,-79.416936,Ceiling Champions,43.713891,-79.420702,Home Service
1,The Beaches,43.711695,-79.416936,Rosalind's Garden Oasis,43.712189,-79.411978,Garden
2,"The Danforth West, Riverdale",43.653206,-79.400049,Kid Icarus,43.653933,-79.401719,Arts & Crafts Store
3,"The Danforth West, Riverdale",43.653206,-79.400049,Essence of Life Organics,43.654111,-79.400431,Organic Grocery
4,"The Danforth West, Riverdale",43.653206,-79.400049,Blackbird Baking Co,43.654764,-79.400566,Bakery


How many values were returned for each neighborhood?

In [83]:
toronto_1_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,14,14,14,14,14,14
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",8,8,8,8,8,8
Central Bay Street,36,36,36,36,36,36
Church and Wellesley,23,23,23,23,23,23
"Commerce Court, Victoria Hotel",12,12,12,12,12,12
Davisville,5,5,5,5,5,5
Davisville North,13,13,13,13,13,13
"First Canadian Place, Underground city",1,1,1,1,1,1
"Forest Hill North & West, Forest Hill Road Park",2,2,2,2,2,2
"Harbourfront East, Union Station, Toronto Islands",12,12,12,12,12,12


In [84]:
print('There are {} uniques categories.'.format(len(toronto_denc_venues['Venue Category'].unique())))

There are 183 uniques categories.


#### Let's now analyze each neighborhood:

In [85]:
# one hot encoding
toronto_denc_onehot = pd.get_dummies(toronto_denc_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_denc_onehot['Neighborhood'] = toronto_denc_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_denc_onehot.columns[-1]] + list(toronto_denc_onehot.columns[:-1])
toronto_denc_onehot = toronto_denc_onehot[fixed_columns]

toronto_denc_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,...,Thrift / Vintage Store,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [86]:
toronto_denc_grouped = toronto_denc_onehot.groupby('Neighborhood').mean().reset_index()
toronto_denc_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,...,Thrift / Vintage Store,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.027778
3,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_denc_grouped['Neighborhood']

for ind in np.arange(toronto_denc_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_denc_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Breakfast Spot,Gift Shop,Coffee Shop,Movie Theater,Eastern European Restaurant,Italian Restaurant,Bar,Dog Run,Restaurant,Dessert Shop
1,"CN Tower, King and Spadina, Railway Lands, Har...",Pet Store,Beer Store,Pharmacy,Pizza Place,Café,Liquor Store,Shopping Plaza,Coffee Shop,Gaming Cafe,Cuban Restaurant
2,Central Bay Street,Coffee Shop,Café,Pub,Italian Restaurant,Pizza Place,Sushi Restaurant,Bar,Smoothie Shop,IT Services,Falafel Restaurant
3,Church and Wellesley,Café,Coffee Shop,Breakfast Spot,Performing Arts Venue,Bakery,Convenience Store,Pet Store,Climbing Gym,Burrito Place,Restaurant
4,"Commerce Court, Victoria Hotel",Café,Liquor Store,Gym,Bakery,Mexican Restaurant,Pharmacy,American Restaurant,Pizza Place,Seafood Restaurant,Fast Food Restaurant


Now save the grouped Toronto data into a csv frame so that we can load it into a new notebook.

In [96]:
toronto_denc_grouped.to_csv('Toronto_data.csv',sep='\t')
toronto_denc_grouped.to_csv('/Users/Toronto_data.csv')

## Cluster the neighborhood data

In [98]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [110]:
# set number of clusters
kclusters = 5

toronto_denc_grouped_clustering = toronto_denc_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, n_init = 10, random_state=0).fit(toronto_denc_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 0], dtype=int32)

In [106]:
kmeans.cluster_centers_

array([[ 7.32600733e-03,  2.80112045e-03,  2.80112045e-03,
         5.60224090e-03,  8.40336134e-03,  5.60224090e-03,
         9.87711214e-03,  5.12032770e-04,  2.58250710e-03,
         4.11398081e-03,  4.67532468e-03,  2.97619048e-03,
         5.12032770e-04,  2.97619048e-03,  5.12032770e-04,
         2.15597225e-02,  5.62169312e-03,  3.33803454e-02,
        -1.38777878e-17,  5.12032770e-04,  8.07102502e-04,
         3.81123059e-03,  7.03463203e-03,  8.07102502e-04,
         5.12032770e-04,  2.80112045e-03,  7.69646210e-03,
         6.46412411e-03,  4.76190476e-04,  1.32508664e-02,
         7.03463203e-03,  4.76190476e-04,  6.26904796e-03,
         1.61304466e-02,  6.93889390e-18,  4.93535058e-02,
         1.98412698e-03,  2.97619048e-03,  8.07102502e-04,
         1.83116804e-03,  5.12032770e-04,  2.07039337e-03,
         7.83804010e-03,  3.42545189e-03,  6.01847272e-02,
         1.32275132e-03,  4.76190476e-04,  1.31913527e-03,
         2.97619048e-03,  1.46441372e-03,  5.73339704e-0

In [111]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_denc_merged = df_toronto_denc

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_denc_merged = toronto_denc_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_denc_merged.head() # check the last columns!


Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E\n,East Toronto\n,The Beaches,M5N,43.711695,-79.416936,4.0,Home Service,Garden,Yoga Studio,Dessert Shop,Event Space,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
1,M4K\n,East Toronto\n,"The Danforth West, Riverdale",M5T,43.653206,-79.400049,0.0,Café,Coffee Shop,Vietnamese Restaurant,Bakery,Mexican Restaurant,Vegetarian / Vegan Restaurant,Dessert Shop,Grocery Store,Bar,Gaming Cafe
2,M4L\n,East Toronto\n,"India Bazaar, The Beaches West",M5V,43.628947,-79.39442,0.0,Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Coffee Shop,Boutique,Rental Car Location,Bar,Plane,Sculpture Garden
3,M4M\n,East Toronto\n,Studio District,M5W,43.646435,-79.374846,0.0,Coffee Shop,Café,Japanese Restaurant,Pub,Seafood Restaurant,Beer Bar,Cocktail Bar,Restaurant,Italian Restaurant,Gym
4,M4N\n,Central Toronto\n,Lawrence Park,M5X,43.648429,-79.38228,0.0,Coffee Shop,Café,Japanese Restaurant,Restaurant,Gym,Hotel,Salad Place,Seafood Restaurant,Deli / Bodega,American Restaurant


In [128]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
        toronto_denc_merged['Latitude'], 
        toronto_denc_merged['Longitude'], 
        toronto_denc_merged['Neighborhood'], 
        toronto_denc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[kclusters-1],
        fill=True,
        fill_color=rainbow[kclusters-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [125]:
toronto_denc_merged.loc[toronto_denc_merged['Cluster Labels'] == 0, toronto_denc_merged.columns[[1] + list(range(5, toronto_denc_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto\n,-79.400049,0.0,Café,Coffee Shop,Vietnamese Restaurant,Bakery,Mexican Restaurant,Vegetarian / Vegan Restaurant,Dessert Shop,Grocery Store,Bar,Gaming Cafe
2,East Toronto\n,-79.39442,0.0,Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Coffee Shop,Boutique,Rental Car Location,Bar,Plane,Sculpture Garden
3,East Toronto\n,-79.374846,0.0,Coffee Shop,Café,Japanese Restaurant,Pub,Seafood Restaurant,Beer Bar,Cocktail Bar,Restaurant,Italian Restaurant,Gym
4,Central Toronto\n,-79.38228,0.0,Coffee Shop,Café,Japanese Restaurant,Restaurant,Gym,Hotel,Salad Place,Seafood Restaurant,Deli / Bodega,American Restaurant
5,Central Toronto\n,-79.464763,0.0,Furniture / Home Store,Accessories Store,Clothing Store,Boutique,Women's Store,Event Space,Miscellaneous Shop,Coffee Shop,Vietnamese Restaurant,American Restaurant
6,Central Toronto\n,-79.445073,0.0,Park,Pizza Place,Japanese Restaurant,Pub,Deli / Bodega,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
7,Central Toronto\n,-79.428191,0.0,Field,Playground,Hockey Arena,Trail,Tennis Court,Yoga Studio,Deli / Bodega,Dumpling Restaurant,Donut Shop,Doner Restaurant
9,Central Toronto\n,-79.422564,0.0,Grocery Store,Café,Park,Candy Store,Coffee Shop,Nightclub,Restaurant,Italian Restaurant,Diner,Baby Store
10,Downtown Toronto\n,-79.442259,0.0,Bakery,Pharmacy,Pool,Pizza Place,Bank,Supermarket,Middle Eastern Restaurant,Café,Bar,Furniture / Home Store
11,Downtown Toronto\n,-79.41975,0.0,Bar,Asian Restaurant,Men's Store,Restaurant,Vegetarian / Vegan Restaurant,Coffee Shop,Café,Yoga Studio,Beer Store,Pizza Place
