# IBM DATA SCIENCE PROFESSIONAL CERTIFICATE CAPSTONE PROJECT
### This is Navie Huynh's capstone project for IBM's Data science professional certificate course

## Download packages 

In [1]:
!conda install beautifulsoup4 --yes
!conda install lxml --yes
!conda install html5lib
!conda install -c conda-forge geopy --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         156 KB
    beautifulsoup4-4.8.1       |           py36_0         153 KB
    soupsieve-1.9.5            |           py36_0          61 KB
    openssl-1.1.1d             |       h7b6447c_3         3.7 MB
    ------------------------------------------------------------
                                           Total:         4.1 MB

The following NEW packages will be INSTALLED:

    soupsieve:      1.9.5-py36_0                 

The following packages will be UPDATED:

    beautifulsoup4: 4.6.3-py37_0      

## Import packages

In [73]:
import pandas as pd
import numpy as np
import requests #for retrieving web data
from bs4 import BeautifulSoup #
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium #Create visualizations
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors


## Webscraping with BeautifulSoup and Pandas

We will webscrape Canadian Postal/Borough/Neighborhood data from wikipedia:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


First we will use the requests package to save the link into a python object

In [3]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

Next we will use *BeautifulSoup* to extract/query the html object to find the desired wikipedia table

In [4]:
soup = BeautifulSoup(res.content,'html.parser')
table = soup.find_all('table')[0]

## Data preparation

Create a pandas dataframe with data from wikipedia page above.  
Dropped rows with *Borough* == 'Not Assigned'

In [5]:
df = pd.read_html(str(table),flavor='html5lib',header=0)[0]
df = df[df.Borough != 'Not assigned']
df

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


The initial assignment of df_sorted does not remove the intital indexing from pd.read_html and so we call reset_index to reset the index of the new dataframe

In [6]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


Find all occurances where Neighborhood is not assigned and assign Borough value 

In [7]:
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighborhood
6,M9A,Queen's Park,Not assigned


Use the replace method to change any 'Not assigned' values in Neighborhood with values from 'Borough'.  
Make sure inplace is *True* to apply method on same df object

In [8]:
df.Neighborhood.replace('Not assigned',df['Borough'], inplace=True)
df[df['Neighborhood'] == 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### Adding Latitude and Longitude to Dataframe

Use geospacial data of toronto postal code and latitude and longitude

In [9]:
df_lat_long = pd.read_csv('https://cocl.us/Geospatial_data')
df_lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Here we will create a new dataframe with the neighborhood data from wikipedia and the latitude/longitude data from the dataframe above** 


In [10]:
df_merge = pd.merge(df,df_lat_long,left_on="Postcode",right_on="Postal Code")
df_merge.drop(['Postal Code'], axis=1, inplace=True)
df_merge.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Heights,43.718518,-79.464763
4,M6A,North York,Lawrence Manor,43.718518,-79.464763


## Exploratory Data Analysis

We will first look at the unique boroughs and number of entries in the dataframe

In [11]:
print(df_merge['Borough'].unique())

['North York' 'Downtown Toronto' "Queen's Park" 'Scarborough' 'East York'
 'Etobicoke' 'York' 'East Toronto' 'West Toronto' 'Central Toronto'
 'Mississauga']


In [40]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_merge['Borough'].unique()),
        df_merge.shape[0]
    )
)

The dataframe has 11 boroughs and 210 neighborhoods.


Limit the exploration to boroughs containing 'Toronto'

In [13]:
toronto_borough = ['Downtown Toronto', 'Central Toronto', 'West Toronto', 'East Toronto']
toronto_data=df_merge[df_merge['Borough'].isin(toronto_borough)]
toronto_data.reset_index(inplace=True)
toronto_data.head()

Unnamed: 0,index,Postcode,Borough,Neighborhood,Latitude,Longitude
0,2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,5,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,12,M5B,Downtown Toronto,Ryerson,43.657162,-79.378937
3,13,M5B,Downtown Toronto,Garden District,43.657162,-79.378937
4,26,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418


Lets see how many unique neighborhoods are in the toronto boroughs

In [43]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_data['Borough'].unique()),
        len(toronto_data['Neighborhood'].unique())
    )
)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f66f07da978>

### Exploring Toronto Neighborhoods

Create a *toronto_explorer* to get latitude and longitude information about Toronto 

In [14]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geographical coordinate of Toronto are {}, {}'.format(latitude,longitude))

The geographical coordinate of Toronto are 43.653963, -79.387207


Create visualization of Toronto with neighborhoods in it

In [15]:
#Initialize map centered in Toronto
map_toronto = folium.Map(location=[latitude,longitude], zoom_start=11)

#add markers to map
for lat, lng, label in zip(toronto_data['Latitude'],toronto_data['Longitude'],toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186aa',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

### Define Foursquare Credentials and Version

### Exploring the first neighborhood in dataframe

In [17]:
toronto_data.loc[0,'Neighborhood']

'Harbourfront'

Store neighborhood name, latitude and longtidude for future use

In [18]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude']
neighborhood_longitude = toronto_data.loc[0,'Longitude']
neighborhood_name = toronto_data.loc[0,'Neighborhood']

print('Latitude and Longitude of {} are {}, {}'.format(neighborhood_name,neighborhood_latitude,neighborhood_longitude))

Latitude and Longitude of Harbourfront are 43.6542599, -79.3606359


We retrieve the top 100 venues in Harbourfront within radius of 500 meters and store into *results*

In [19]:
radius = 500
limit= 100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    limit
    )
results = requests.get(url).json()



In [20]:
# Borrowing this from Foursquare lab function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Store results data into a dataframe

In [21]:
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) #dataframe with json information
nearby_venues.columns

Index(['referralId', 'reasons.count', 'reasons.items', 'venue.id',
       'venue.name', 'venue.location.address', 'venue.location.crossStreet',
       'venue.location.lat', 'venue.location.lng',
       'venue.location.labeledLatLngs', 'venue.location.distance',
       'venue.location.postalCode', 'venue.location.cc', 'venue.location.city',
       'venue.location.state', 'venue.location.country',
       'venue.location.formattedAddress', 'venue.categories',
       'venue.photos.count', 'venue.photos.groups', 'venue.venuePage.id',
       'venue.location.neighborhood'],
      dtype='object')

Gross. Lets simplify the dataframe to contain the name of the venue, the category, latitude and longitude

In [22]:
#filter dataframe to name, categories, latitude, longitude
filtered_columns = ['venue.name','venue.categories','venue.location.lat','venue.location.lng']
nearby_venues=nearby_venues.loc[:,filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type,axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Gym / Fitness Center,43.653191,-79.357947
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Impact Kitchen,Restaurant,43.656369,-79.35698


Create a function that can be used to apply this to other Neighborhoods

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius = 500):
    venues_list =[]
    
    for name, lat, lng in zip(names,latitudes,longitudes):
        print(name)
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            limit
            )
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)    

In [25]:
toronto_venues = getNearbyVenues(
    names=toronto_data['Neighborhood'],
    latitudes=toronto_data['Latitude'],
    longitudes=toronto_data['Longitude']
    )

Harbourfront
Queen's Park
Ryerson
Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide
King
Richmond
Dovercourt Village
Dufferin
Harbourfront East
Toronto Islands
Union Station
Little Portugal
Trinity
The Danforth West
Riverdale
Design Exchange
Toronto Dominion Centre
Brockton
Exhibition Place
Parkdale Village
The Beaches West
India Bazaar
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North
Forest Hill West
High Park
The Junction South
North Toronto West
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
Harbord
University of Toronto
Runnymede
Swansea
Moore Park
Summerhill East
Chinatown
Grange Park
Kensington Market
Deer Park
Forest Hill SE
Rathnelly
South Hill
Summerhill West
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina
Railway Lands
South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown
St. James Town
First Canadian Place
Underground city

In [47]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Lets see how many venues were returned ion each neighborhood

In [27]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,17,17,17,17,17,17
Berczy Park,55,55,55,55,55,55
Brockton,23,23,23,23,23,23
Business Reply Mail Processing Centre 969 Eastern,15,15,15,15,15,15
...,...,...,...,...,...,...
Underground city,100,100,100,100,100,100
Union Station,100,100,100,100,100,100
University of Toronto,36,36,36,36,36,36
Victoria Hotel,100,100,100,100,100,100


Determining how many unique venue types there are

In [29]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 240 uniques categories.


### Analyzing the venues within the Neighborhoods

In [49]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
toronto_onehot.shape

(3260, 241)

Group the rows by neighborhood, taking the mean frequency for each category

In [50]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01
1,Bathurst Quay,0.0,0.058824,0.058824,0.058824,0.058824,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
toronto_grouped.shape

(73, 241)

In [None]:
### Print top 5 

In [51]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide----
         venue  freq
0  Coffee Shop  0.08
1         Café  0.04
2   Steakhouse  0.04
3          Bar  0.04
4   Restaurant  0.03


----Bathurst Quay----
              venue  freq
0   Airport Service  0.18
1  Airport Terminal  0.12
2   Harbor / Marina  0.06
3           Airport  0.06
4          Boutique  0.06


----Berczy Park----
          venue  freq
0   Coffee Shop  0.09
1  Cocktail Bar  0.05
2   Cheese Shop  0.04
3        Bakery  0.04
4    Steakhouse  0.04


----Brockton----
            venue  freq
0            Café  0.13
1       Nightclub  0.09
2  Breakfast Spot  0.09
3     Coffee Shop  0.09
4   Burrito Place  0.04


----Business Reply Mail Processing Centre 969 Eastern----
           venue  freq
0    Pizza Place  0.07
1        Butcher  0.07
2  Garden Center  0.07
3        Brewery  0.07
4  Auto Workshop  0.07


----CN Tower----
              venue  freq
0   Airport Service  0.18
1  Airport Terminal  0.12
2   Harbor / Marina  0.06
3           Airport  0.06
4          Bo

Storing this information in a dataframe

In [52]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [59]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1,indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
neighborhood_top_venues = pd.DataFrame(columns=columns)
neighborhood_top_venues['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhood_top_venues.iloc[ind,1:] = return_most_common_venues(toronto_grouped.iloc[ind,:], num_top_venues)
    

In [60]:
neighborhood_top_venues.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Café,Bar,Steakhouse,Restaurant,Burger Joint,Sushi Restaurant,Asian Restaurant,Hotel,Thai Restaurant
1,Bathurst Quay,Airport Service,Airport Terminal,Harbor / Marina,Bar,Coffee Shop,Rental Car Location,Sculpture Garden,Boutique,Boat or Ferry,Plane
2,Berczy Park,Coffee Shop,Cocktail Bar,Steakhouse,Beer Bar,Bakery,Cheese Shop,Farmers Market,Café,Seafood Restaurant,Japanese Restaurant
3,Brockton,Café,Coffee Shop,Breakfast Spot,Nightclub,Intersection,Bar,Italian Restaurant,Stadium,Climbing Gym,Furniture / Home Store
4,Business Reply Mail Processing Centre 969 Eastern,Pizza Place,Auto Workshop,Skate Park,Park,Light Rail Station,Farmers Market,Fast Food Restaurant,Burrito Place,Butcher,Restaurant


## Machine Learning Model

We will use the K-means clustering algorithm using k=5

In [63]:
k = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood',1)

kmeans= KMeans(n_clusters=k,random_state=2).fit(toronto_grouped_clustering)
kmeans.labels_[0:10]

array([1, 4, 1, 1, 1, 4, 1, 1, 1, 1], dtype=int32)

Combine this and the top 10 venues into the *toronto_merged* dataframe

In [64]:
neighborhood_top_venues.insert(0,'Cluster labels', kmeans.labels_)
toronto_merged = toronto_data

toronto_merged = toronto_merged.join(neighborhood_top_venues.set_index('Neighborhood'), on='Neighborhood')

In [67]:
toronto_merged.drop('index',inplace=True,axis=1)

In [68]:
toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Pub,Restaurant,Café,Mexican Restaurant,Breakfast Spot,Yoga Studio,Chocolate Shop
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Park,Gym,Sushi Restaurant,Yoga Studio,Smoothie Shop,Burger Joint,Sandwich Place,Burrito Place,Café
2,M5B,Downtown Toronto,Ryerson,43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Bakery,Middle Eastern Restaurant,Pizza Place,Bubble Tea Shop,Plaza,Sporting Goods Shop
3,M5B,Downtown Toronto,Garden District,43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Bakery,Middle Eastern Restaurant,Pizza Place,Bubble Tea Shop,Plaza,Sporting Goods Shop
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Restaurant,Bakery,Italian Restaurant,Breakfast Spot,Clothing Store,Diner,Cosmetics Shop,Beer Bar


Another visualization of the clusters

In [74]:
map_clusters = folium.Map(location=[latitude,longitude], zoom_start=11)

x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'],toronto_merged['Longitude'],toronto_merged['Neighborhood'],toronto_merged['Cluster labels']):
    label = folium.Popup(str(poi)+' Cluster '+str(cluster), parse_html=True )
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    

In [75]:
map_clusters

Does not look like a good clustering machine