# Applied Data Science Capstone Project

### Introduction

This project is carried out mainly to explore and group the Neighborhoods of Toronto City in Canada. The data was retrieved from a webpage and processed to have it cleaned. The objective of this study is to come with clusters of specified neighborhoods based upon their locations. In  mapping those clusters, we can have an idea about the main venues appearing in each cluster and or neighborhood. Businesses or individual might use this information to pinpoint the location of their future projects.

### Step 1 : Scraping the dataset with Beautifulsoup & Requests libraries

This first step consists of retrieving a table from a webpage using the Beautifulsoup library. The main thing is to be able to retrieve the specific attribute ('object', 'div', 'class', 'id'...) that contains the table (or whatever information to be retrieved.
For the this webpage (link provided below), the attribute was found to be of object 'table'.
The table was retrieved in text format and then transformed into a dataframe using pandas library

In [2]:
# import libraries
from bs4 import BeautifulSoup #use to scrape data online
import requests #use to scrape data online
import pandas as pd # read data into a dataframe format
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
from sklearn.cluster import KMeans # import k-means from clustering stage
import numpy as np
import matplotlib.cm as cm  # Matplotlib and associated plotting modules
import matplotlib.colors as colors

In [3]:
# input the url of the webpage and fetch the header from the browser (used firefox)
# the header can be retrieved from www.whoishostingthis.com

url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

headers= {'User-Agent': 'Mozilla/5.0'}        

In [4]:
response = requests.get(url)

In [5]:
# run this code to make sure the content of the webpage link is established properly. The output should be 200

response.status_code

200

In [6]:
soup = BeautifulSoup(response.content, 'html.parser')

In [7]:
# open the webpage link and double click somwhere in the document and click on inspect element
# try to find the attribute(object, class, div, id...) where the table is stored
# in this case the attribute is 'table'

stat_table = soup.find('table')

In [8]:
# check the length of the table to make sure, there is some data that was captured(should be at least 1)
len(stat_table)

2

In [9]:
# loop over all rows ("tr") and cells ("td") to add them all to the table.
# The attribute ("tr") and ("td") can be seen through the link of the webpage by double clicking and choosing 'inspect element'

for row in stat_table.find_all('tr'):
    for cell in row.find_all('td'):
        print(cell.text)

M1A
Not assigned
Not assigned

M2A
Not assigned
Not assigned

M3A
North York
Parkwoods

M4A
North York
Victoria Village

M5A
Downtown Toronto
Harbourfront

M6A
North York
Lawrence Heights

M6A
North York
Lawrence Manor

M7A
Downtown Toronto
Queen's Park

M8A
Not assigned
Not assigned

M9A
Etobicoke
Islington Avenue

M1B
Scarborough
Rouge

M1B
Scarborough
Malvern

M2B
Not assigned
Not assigned

M3B
North York
Don Mills North

M4B
East York
Woodbine Gardens

M4B
East York
Parkview Hill

M5B
Downtown Toronto
Ryerson

M5B
Downtown Toronto
Garden District

M6B
North York
Glencairn

M7B
Not assigned
Not assigned

M8B
Not assigned
Not assigned

M9B
Etobicoke
Cloverdale

M9B
Etobicoke
Islington

M9B
Etobicoke
Martin Grove

M9B
Etobicoke
Princess Gardens

M9B
Etobicoke
West Deane Park

M1C
Scarborough
Highland Creek

M1C
Scarborough
Rouge Hill

M1C
Scarborough
Port Union

M2C
Not assigned
Not assigned

M3C
North York
Flemingdon Park

M3C
North York
Don Mills South

M4C
East York
Woodbine Height

In [10]:
# print the output table and open the text file in the directory
# do a little string formatting by chnaging the number inside ljutst() and looking at the output text file

with open ('Toronto_stats.txt', 'w') as r:
    for row in stat_table.find_all('tr'):
        for cell in row.find_all('td'):
            r.write(cell.text.ljust(30))
        r.write('\n')

In [11]:
# Read the output text file with pandas library

path=r'C:\Users\HP\Desktop\Skillshare\IBM Data Sciences\Machine Learning\Toronto_stats.txt'
df = pd.read_fwf(path)
df.columns = ["Postcode", "Borough", "Neighborhood"]
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,,,
2,M2A,Not assigned,Not assigned
3,,,
4,M3A,North York,Parkwoods
...,...,...,...
569,,,
570,M8Z,Etobicoke,South of Bloor
571,,,
572,M9Z,Not assigned,Not assigned


### Step 2: Data Preprocessing

The dataset was retrieved with a lot of missing elements. The following lines of codes 
are carried out to clean the data and explore them a bit before moving forward

In [12]:
# drop all the rows with NaN
df.dropna(how='all', axis=0, inplace=True)
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
4,M3A,North York,Parkwoods
6,M4A,North York,Victoria Village
8,M5A,Downtown Toronto,Harbourfront
...,...,...,...
564,M8Z,Etobicoke,Mimico NW
566,M8Z,Etobicoke,The Queensway West
568,M8Z,Etobicoke,Royal York South We
570,M8Z,Etobicoke,South of Bloor


In [13]:
# drop all rows having 'Not assigned' values in the 'Borough' and 'Neighborhood' column

NA = df[(df['Borough'] == 'Not assigned') & (df['Neighborhood'] == 'Not assigned')].index
df.drop(NA , inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
4,M3A,North York,Parkwoods
6,M4A,North York,Victoria Village
8,M5A,Downtown Toronto,Harbourfront
10,M6A,North York,Lawrence Heights
12,M6A,North York,Lawrence Manor


In [14]:
# drop all rows with 'Not asigned' values in the 'Borough' column
NA1 = df[(df['Borough'] == 'Not assigned')].index
df.drop(NA1, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
4,M3A,North York,Parkwoods
6,M4A,North York,Victoria Village
8,M5A,Downtown Toronto,Harbourfront
10,M6A,North York,Lawrence Heights
12,M6A,North York,Lawrence Manor


In [15]:
# Join the neighborhood of similar postcodes in the same rows separated by commas

df = df.groupby('Postcode').agg({'Borough':'first', 
                             'Neighborhood': ', '.join}).reset_index()
print (df[['Postcode','Borough','Neighborhood']])  

    Postcode      Borough                                       Neighborhood
0        M1B  Scarborough                                     Rouge, Malvern
1        M1C  Scarborough             Highland Creek, Rouge Hill, Port Union
2        M1E  Scarborough                  Guildwood, Morningside, West Hill
3        M1G  Scarborough                                             Woburn
4        M1H  Scarborough                                          Cedarbrae
..       ...          ...                                                ...
98       M9N         York                                             Weston
99       M9P    Etobicoke                                          Westmount
100      M9R    Etobicoke  Kingsview Village, Martin Grove Garden, Richvi...
101      M9V    Etobicoke  Albion Gardens, Beaumond Heights, Humbergate, ...
102      M9W    Etobicoke                                          Northwest

[103 rows x 3 columns]


In [16]:
# print the 10 first rows for a quick look

df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Par, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [17]:
# check the dimension of the dataframe
df.shape

(103, 3)

In [18]:
# check if the column 'Neighborhood' has a cell with 'Not assigned' value
df[(df['Neighborhood'] == 'Not assigned')].count()

Postcode        0
Borough         0
Neighborhood    0
dtype: int64

In [19]:
# checck whether there are any missing values ('NaN') left in the data

missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

Postcode
False    103
Name: Postcode, dtype: int64

Borough
False    103
Name: Borough, dtype: int64

Neighborhood
False    103
Name: Neighborhood, dtype: int64



In [20]:
# summary statistics 

df.describe()

Unnamed: 0,Postcode,Borough,Neighborhood
count,103,103,103
unique,103,10,103
top,M5S,North York,York Mills West
freq,1,24,1


In [21]:
# get broad information about the dataset

df.info

<bound method DataFrame.info of     Postcode      Borough                                       Neighborhood
0        M1B  Scarborough                                     Rouge, Malvern
1        M1C  Scarborough             Highland Creek, Rouge Hill, Port Union
2        M1E  Scarborough                  Guildwood, Morningside, West Hill
3        M1G  Scarborough                                             Woburn
4        M1H  Scarborough                                          Cedarbrae
..       ...          ...                                                ...
98       M9N         York                                             Weston
99       M9P    Etobicoke                                          Westmount
100      M9R    Etobicoke  Kingsview Village, Martin Grove Garden, Richvi...
101      M9V    Etobicoke  Albion Gardens, Beaumond Heights, Humbergate, ...
102      M9W    Etobicoke                                          Northwest

[103 rows x 3 columns]>

### Step 3: Data cleaning and wrangling

In [22]:
# import the Neighborhood longitude and latitude values
path_coord = r'C:\Users\HP\Desktop\Skillshare\IBM Data Sciences\Capstone\Geospatial coordinates.csv'
df_coord = pd.read_csv(path_coord)    
df_coord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [23]:
# Change the 'postal code' name into 'Postcode'
df_coord.rename(columns={'Postal Code':'Postcode'}, inplace = True)
df_coord

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [24]:
# Merge the two dataframes to get a single and final dataset
df_toronto = pd.merge(df, df_coord, how='right', on=['Postcode'])
df_toronto

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Garden, Richvi...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


In [25]:
# check if the Latitude and Longitude values correspond to the dataframe df_coord
df_toronto.loc[df_toronto['Postcode']=='M4X', 'Latitude':'Longitude']

Unnamed: 0,Latitude,Longitude
51,43.667967,-79.367675


In [26]:
# Check if there are any values in the 'Borough' column that start with the word 'Toronto'
df_toronto[df_toronto['Borough'].str.startswith("Toronto")]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude


In [27]:
# Check if there are any values in the 'Borough' that end with the word 'Toronto'
df_toronto[df_toronto['Borough'].str.endswith("Toronto")]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


In [32]:
# lets name the resulting dataframe as df_tor
df_tor = df_toronto[df_toronto['Borough'].str.endswith('Toronto')]
df_tor.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [33]:
df_tor.dtypes

Postcode         object
Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

In [34]:
# explore a bit the new dataframe df_tor
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_tor['Borough'].unique()),
        df_tor.shape[0]
    )
)

The dataframe has 4 boroughs and 39 neighborhoods.


In [35]:
# install the geopy library to fetch the geographical coordinates of the city of Toronto
!pip install geopy



In [36]:
# Grab the coordinates of Toronto

user_agent= "Tr_explorer"
address = 'Toronto City, TOR'

geolocator = Nominatim(user_agent=user_agent)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.7627912, -79.4064452.


#### Create a map of Toronto with neighborhoods superimposed on top

In [37]:
# create map of Toronto using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_tor['Latitude'], df_tor['Longitude'], df_tor['Borough'], df_tor['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

#### Next, let us start utilizing the Foursquare API to explore the neighborhoods and segment them.

### Define Foursquare Credentials and Version

In [38]:
CLIENT_ID = 'ZKYTNQFGWULGM4TD3CCLWN0MPNQPCMJEBJRT3V3SNPGBJG3Z' # your Foursquare ID
CLIENT_SECRET = 'H2A25N4SNMIJ3XZZ0MCTNLZH1MB1NZ1XSXOUEWWVKBOGEA5L' # your Foursquare Secret
VERSION = '20180604'

print('Your credentials:')
print('CLIENT_ID:' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID:ZKYTNQFGWULGM4TD3CCLWN0MPNQPCMJEBJRT3V3SNPGBJG3Z
CLIENT_SECRET:H2A25N4SNMIJ3XZZ0MCTNLZH1MB1NZ1XSXOUEWWVKBOGEA5L


### Step 4: Explore the neighborhoods in Toronto

Let's create a function  to explore the top 50 venues that are in the
vicinity of all neigborhoods within a radius of 500 meters

In [39]:
LIMIT = 50 
radius = 500 # define radius
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [41]:
Toronto_venues = getNearbyVenues(names=df_tor['Neighborhood'],
                                   latitudes=df_tor['Latitude'],
                                   longitudes=df_tor['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesle
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Ce
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toron
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 T
First Canadian Plac, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction South
Parkdale, Roncesva

#### Let's check the size of the resulting dataframe

In [42]:
print(Toronto_venues.shape)
Toronto_venues.head()

(1221, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


#### Let's check how many venues were returned for each neighborhood

In [43]:
Toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",50,50,50,50,50,50
Berczy Park,50,50,50,50,50,50
"Brockton, Exhibition Place, Parkdale Village",23,23,23,23,23,23
Business Reply Mail,18,18,18,18,18,18
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17,17,17,17,17,17
"Cabbagetown, St. James Town",49,49,49,49,49,49
Central Bay Street,50,50,50,50,50,50
"Chinatown, Grange Park, Kensington Market",50,50,50,50,50,50
Christie,18,18,18,18,18,18
Church and Wellesle,50,50,50,50,50,50


#### Let's find out how many unique categories can be curated from all the returned venues

In [44]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 220 uniques categories.


In [45]:
# the most widespread venue category

Toronto_venues['Venue Category'].value_counts().idxmax()

'Coffee Shop'

In [46]:
# Group the toronto_venues dataframe by neighborhood to prepare for the clustering work
# Notice that the column of "Venue category" was dropped because of the call .mean()

tor_grouped = Toronto_venues.groupby('Neighborhood').mean().reset_index()
tor_grouped

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude
0,"Adelaide, King, Richmond",43.650571,-79.384568,43.649827,-79.384901
1,Berczy Park,43.644771,-79.373306,43.647211,-79.373872
2,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191,43.637283,-79.42584
3,Business Reply Mail,43.662744,-79.321558,43.664212,-79.323205
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442,43.63104,-79.395489
5,"Cabbagetown, St. James Town",43.667967,-79.367675,43.665966,-79.368732
6,Central Bay Street,43.657952,-79.387383,43.657708,-79.385135
7,"Chinatown, Grange Park, Kensington Market",43.653206,-79.400049,43.654085,-79.4006
8,Christie,43.669542,-79.422564,43.670515,-79.42399
9,Church and Wellesle,43.66586,-79.38316,43.665645,-79.382891


In [47]:
# check the shape of the new dataframe

tor_grouped.shape

(39, 5)

### Step 5: Clustering Neighborhoods

#### Run k-means to cluster the neighborhood into 5 clusters

In [48]:
# set number of clusters
kclusters = 5

tor_grouped_clustering = tor_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tor_grouped_clustering)

# check cluster labels generated for each row in the dataframe

labels = kmeans.labels_
print(labels[0:10])

[2 2 4 1 2 2 2 4 4 2]


In [49]:
# add clustering labels
tor_grouped["Cluster Labels"] = labels
tor_grouped.head(5)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude,Cluster Labels
0,"Adelaide, King, Richmond",43.650571,-79.384568,43.649827,-79.384901,2
1,Berczy Park,43.644771,-79.373306,43.647211,-79.373872,2
2,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191,43.637283,-79.42584,4
3,Business Reply Mail,43.662744,-79.321558,43.664212,-79.323205,1
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442,43.63104,-79.395489,2


#### As the saying goes, a picture is worth 100 words. Data sciences is no exception to that. Let's visualize the clusters

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_grouped['Neighborhood Latitude'], tor_grouped['Neighborhood Longitude'], tor_grouped['Neighborhood'], tor_grouped['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [51]:
# Group the cluster 0  by Neighborhood
tor_group_L = tor_grouped[tor_grouped['Cluster Labels'] == 0]
tor_group_L

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude,Cluster Labels
11,Davisville,43.704324,-79.38879,43.704589,-79.388797,0
12,Davisville North,43.712751,-79.390197,43.712353,-79.39163,0
13,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,43.687617,-79.396615,0
17,"Forest Hill North, Forest Hill West",43.696948,-79.411307,43.699942,-79.408516,0
22,Lawrence Park,43.72802,-79.38879,43.72784,-79.386683,0
24,"Moore Park, Summerhill East",43.689574,-79.38316,43.692518,-79.384427,0
25,North Toronto West,43.715383,-79.405678,43.715045,-79.400376,0
29,Roselawn,43.711695,-79.416936,43.712189,-79.411978,0


In [52]:
# Determine the venue categories of a specific Neighbood listed

VC_Davisville = Toronto_venues[Toronto_venues['Neighborhood']=='Davisville']
VC_Davisville['Venue Category']

143           Dessert Shop
144                   Café
145            Pizza Place
146      Indian Restaurant
147           Dessert Shop
148     Seafood Restaurant
149     Italian Restaurant
150            Coffee Shop
151                   Park
152       Sushi Restaurant
153     Italian Restaurant
154           Dessert Shop
155        Thai Restaurant
156                    Gym
157       Sushi Restaurant
158                Brewery
159         Sandwich Place
160                  Diner
161                    Gym
162       Toy / Game Store
163             Restaurant
164       Greek Restaurant
165            Coffee Shop
166            Gas Station
167         Farmers Market
168           Gourmet Shop
169         Sandwich Place
170                   Café
171            Pizza Place
172               Pharmacy
173                    Spa
174         Sandwich Place
175       Indoor Play Area
176           Optical Shop
177    Japanese Restaurant
Name: Venue Category, dtype: object

In [53]:
# the number of each venue categories available at Davisville Neighborhood
VC_Davisville['Venue Category'].value_counts()

Sandwich Place         3
Dessert Shop           3
Coffee Shop            2
Café                   2
Sushi Restaurant       2
Italian Restaurant     2
Pizza Place            2
Gym                    2
Greek Restaurant       1
Spa                    1
Japanese Restaurant    1
Park                   1
Brewery                1
Indian Restaurant      1
Thai Restaurant        1
Seafood Restaurant     1
Restaurant             1
Gourmet Shop           1
Gas Station            1
Diner                  1
Optical Shop           1
Indoor Play Area       1
Farmers Market         1
Toy / Game Store       1
Pharmacy               1
Name: Venue Category, dtype: int64

#### let's try and cluster the neighborhood by the venue category

In [54]:
# check the different type of venues available in the Toronto neighborhood selected

Toronto_venues['Venue Category'].value_counts()

Coffee Shop         100
Café                 75
Restaurant           44
Park                 31
Bakery               31
                   ... 
Optical Shop          1
Opera House           1
Theme Restaurant      1
Tennis Court          1
Bus Stop              1
Name: Venue Category, Length: 220, dtype: int64

In [55]:
# Group the Toronto_venues dataframe by Venue Category 'Bakery'

tor_grouped_VC = Toronto_venues.groupby(['Venue Category']).get_group('Bakery')
tor_grouped_VC

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
34,"The Danforth West, Riverdale",43.679557,-79.352188,Dough Bakeshop,43.676643,-79.356846,Bakery
76,Studio District,43.659526,-79.340923,Brick Street Breads,43.660685,-79.342501,Bakery
90,Studio District,43.659526,-79.340923,Bonjour Brioche,43.659734,-79.346266,Bakery
208,"Cabbagetown, St. James Town",43.667967,-79.367675,Absolute Bakery & Café,43.667469,-79.369277,Bakery
231,"Cabbagetown, St. James Town",43.667967,-79.367675,Daniel et Daniel Event Creation & Catering,43.664217,-79.368269,Bakery
242,"Cabbagetown, St. James Town",43.667967,-79.367675,Tasso Baking Co,43.666571,-79.36878,Bakery
300,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
319,Harbourfront,43.65426,-79.360636,Brick Street Bakery,43.650574,-79.359539,Bakery
326,Harbourfront,43.65426,-79.360636,The Sweet Escape Patisserie,43.650632,-79.358709,Bakery
376,"Ryerson, Garden District",43.657162,-79.378937,Danish Pastry House,43.654574,-79.38074,Bakery


#### A young entrepreneur would like to open a new bakery in Toronto. This clustering work of the Bakeries is going to help pinpoint the best location to maximise its profits and reduce the fierce competition. Already 29 bakeries are operating at the moment

**Let's run K means to cluster the Bakeries in Toronto**

In [56]:
# let's drop the columns having non-numerical values for the K-means algorithm
tor_grouped_bk = tor_grouped_VC.drop(['Neighborhood', 'Venue', 'Venue Category'], axis=1)
tor_grouped_bk

Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude
34,43.679557,-79.352188,43.676643,-79.356846
76,43.659526,-79.340923,43.660685,-79.342501
90,43.659526,-79.340923,43.659734,-79.346266
208,43.667967,-79.367675,43.667469,-79.369277
231,43.667967,-79.367675,43.664217,-79.368269
242,43.667967,-79.367675,43.666571,-79.36878
300,43.65426,-79.360636,43.653447,-79.362017
319,43.65426,-79.360636,43.650574,-79.359539
326,43.65426,-79.360636,43.650632,-79.358709
376,43.657162,-79.378937,43.654574,-79.38074


In [57]:
# set number of clusters
kclusters_bk = 5

tor_grouped_bk_clustering = tor_grouped_VC.drop(['Neighborhood', 'Venue', 'Venue Category'], axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters_bk, random_state=0).fit(tor_grouped_bk_clustering)

# check cluster labels generated for each row in the dataframe

labels_bk = kmeans.labels_
print(labels_Bk[0:10])

NameError: name 'labels_Bk' is not defined

In [58]:
# add clustering labels
tor_grouped_VC["Cluster Labels"] = labels_bk
tor_grouped_VC.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels
34,"The Danforth West, Riverdale",43.679557,-79.352188,Dough Bakeshop,43.676643,-79.356846,Bakery,1
76,Studio District,43.659526,-79.340923,Brick Street Breads,43.660685,-79.342501,Bakery,1
90,Studio District,43.659526,-79.340923,Bonjour Brioche,43.659734,-79.346266,Bakery,1
208,"Cabbagetown, St. James Town",43.667967,-79.367675,Absolute Bakery & Café,43.667469,-79.369277,Bakery,1
231,"Cabbagetown, St. James Town",43.667967,-79.367675,Daniel et Daniel Event Creation & Catering,43.664217,-79.368269,Bakery,1


#### Let's visualize the clusters of the bakeries in Toronto

In [59]:
# create map
map_clusters_bk = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters_bk)
ys = [i + x + (i*x)**2 for i in range(kclusters_bk)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_grouped_VC['Neighborhood Latitude'], tor_grouped_VC['Neighborhood Longitude'], tor_grouped_VC['Neighborhood'], tor_grouped_VC['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters_bk)
       
map_clusters_bk

### The locations of cluster 0 and 2 are the best. They need further investigations for the final bakery location