# Applied Data Science Capstone
## Capstone notebook project
### Week 03 - Assignment 01:
#### Segmenting and Clustering Neighborhoods in Toronto

<br><br><br><br><br><br>

---

## Gather the data

Let's start by importing some basic libs

In [1]:
import numpy as np
import pandas as pd

We need to install+import the beautifulsoup4 lib for web web scraping:

In [179]:
# Uncomment the line if you don't have the lib already installed:
# With pip the package name is python3-bs4
#!conda install -c conda-forge beautifulsoup4 lxml --yes

In [180]:
from bs4 import BeautifulSoup as bs
import requests

I have created a small helper to render HTML code, or to display the HTML source.  
It is called **print_html**

In [182]:
from IPython.core.display import display, HTML
def print_html(html_source, view_source=False, nlines_max=30):  # NOTE: we could have used <xmp> tag instead of <textarea> but it is deprecated in some browsers
    html_source = str(html_source); nlines = html_source.count('\n')+1;
    textarea_style = 'width:99.5%;font-size:0.8em;font-family:monospace;line-height:1em;white-space:nowrap;border:none;background-color:rgb(245,245,245)'
    display(HTML( '<textarea rows="'+str(nlines_max if nlines>nlines_max else nlines)+'" style="'+textarea_style+'">'+html_source.replace('</textarea>','&lt;/textarea&gt;')+'</textarea>' if view_source else html_source ))

Now let's import Postal Codes of Canada from a Wikipedia page:

In [183]:
html_source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
html        = bs(html_source, 'lxml')
print_html(html.prettify(), view_source=True)

Let's locate the table we need and gather the data from it:

In [184]:
# Let's look for this node:
#   <th>Postcode</th>
# ... and then from it, we will look for its 1st <table> ascendent.
table = html.find('th', string='Postcode').find_parent('table')
print_html(table.prettify(), view_source=True)

Now we have located the `<table>` node, we need to parse it and insert the details into a DataFrame:

In [7]:
print('Reading table from html into DataFrame...')
df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
trs = table.find_all('tr')
tds = None
max_ntrs_to_display = 4
for irow, tr in enumerate(trs):
    if   (irow<max_ntrs_to_display):  print(f'  [{irow}]')
    elif (irow==max_ntrs_to_display): print( '  ... etc ...')
    skip_row = False
    row = {'PostalCode':None, 'Borough':None, 'Neighborhood':None}
    tds = tr.find_all('td')
    for icol, td in enumerate(tds):
        if (skip_row==True): continue
        col_text = td.get_text().strip()
        if (irow<max_ntrs_to_display): print(f'    [{icol}] - {col_text}')
        if (icol==0):    # PostalCode
            row['PostalCode'] = col_text
        elif (icol==1):  # Borough
            if (col_text.lower()=='not assigned'): skip_row = True
            row['Borough'] = col_text
        elif (icol==2):  # Neighborhood
            if (col_text.lower()=='not assigned'): col_text = row['Borough']
            row['Neighborhood'] = col_text
    if ((skip_row==False) and (row['PostalCode'] is not None) and (row['Borough'] is not None) and (row['Neighborhood'] is not None)):
        df = df.append(row, ignore_index=True)

print(f'\nTable shape: {df.shape}\n')
display(df.head(10))

Reading table from html into DataFrame...
  [0]
  [1]
    [0] - M1A
    [1] - Not assigned
  [2]
    [0] - M2A
    [1] - Not assigned
  [3]
    [0] - M3A
    [1] - North York
    [2] - Parkwoods
  ... etc ...

Table shape: (211, 3)



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now we need to find all rows with the same Postal Code and concatenate their Neighborhood values into a string separated with commas.  
We can do this task using 2 different approaches.  
FIRST approach: **groupby+apply**

In [8]:
dfg = df.groupby(['PostalCode'], as_index=False) # ,'Borough'
igroup = 0 - 1
def process_df_group(df_group):
    # IMPORTANT: apply calls func twice on the first group (this is expected)
    global igroup; igroup += 1
    #if (igroup<4):  display( df_group.head() )
    sr_group_1st_row = df_group.iloc[0, :]  # Get only 1st row of the group
    if (df_group.shape[0]<=1):  return sr_group_1st_row
    sr_group_1st_row['Neighborhood'] = df_group['Neighborhood'].str.cat(sep=',')  # Same as:  ','.join(df['Neighborhood'])
    return sr_group_1st_row  # We can return a series or a dataframe

df2 = dfg.apply(process_df_group)
#df2 = df2.reset_index(drop=True)  # No need since we specified as_index=False while grouping 
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


SECOND approach: **groupby+agg**

In [9]:
def keep_first_in_group(sr_group): return sr_group.iloc[0]
df3 = dfg.agg({'Borough':keep_first_in_group, 'Neighborhood': lambda x:','.join(x)})  # No need to include the columns we groupedby since they are already included and cannot be modified
df3.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
print( 'Shape of final df using groupby+apply method:', df2.shape )
print( 'Shape of final df using groupby+agg   method:', df3.shape )

Shape of final df using groupby+apply method: (103, 3)
Shape of final df using groupby+agg   method: (103, 3)


<br><br><br><br><br><br>

---

## Geolocate the data

Let's install the geolocation lib:

In [12]:
# Let's install a golocation API.
# We can use 'geocoder' (uncomment the following line if you don't have the lib already installed)
#!conda install -c conda-forge geocoder --yes

## NOTE:
## We could also use the 'geopy' API
## !conda install -c conda-forge geopy --yes
## from geopy.geocoders import Nominatim
## geolocator = Nominatim(user_agent="ny_explorer")
## location = geolocator.geocode(address) # location.latitude, location.longitude

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs:
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    certifi-2019.3.9           |           py36_0         149 KB  conda-forge
    conda-4.6.11               |           py36_0         897 KB  conda-forge
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    openssl-1.1.1b             |       h14c3975_1         4.0 MB  conda-forge
    orderedset-2.0             |           py36_0         231 KB  conda-forge
    ratelim-0.1.6              |           py36_0           5 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         5.4 MB

The foll

In [13]:
import geocoder

Now we can try the API and check if we are getting correct values from it when geolocating our neighborhoods...

In [14]:
# Let's try the API
# loop until you get the coordinates
import time
coords = None
i = 0
while(coords is None):
    i += 1;
    if (i>=5): break
    g = geocoder.google('Madrid, Spain')
    coords = g.latlng
    print(coords)
    if (coords is None): time.sleep(2);

print('Retrieved coords:', coords)

None
None
None
None
Retrieved coords: None


<br>

As we can see, we are **NOT** getting any values from the geocoder lib.  
So we need to try a different approach.  
Let's use a CSV table that details all coordinates of the different Postal Codes in Canada:

In [187]:
# Since the geocoder API doesn't seem to be working, let's retrieve the coordinates from the csv file
df_coords = pd.read_csv('https://cocl.us/Geospatial_data', delimiter=",")    
# In older versions of Pandas we had to import io, and do this:
# my_data = pd.read_csv(io.StringIO(requests.get('https://cocl.us/Geospatial_data').content.decode('utf-8')), delimiter=",")    
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now we have the CSV converted to a DataFrame we can rename some of its columns and perform a **LEFT JOIN** merge with the neighborhoods dataframe:  
(NOTE we are merging through the **PostalCode** column, which is common in both tables)

In [189]:
df_coords.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df4 = pd.merge(df3, df_coords, on='PostalCode', how='left')
df4.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [191]:
print( 'Shape of final df using groupby+apply method:', df4.shape )

Shape of final df using groupby+apply method: (103, 5)


<br><br><br><br><br><br>

---

## Display data on a map

We are going to use **Folium** for this task.
Let's import the lib:

In [192]:
# Uncomment the following line if you don't have the lib already installed
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from branca.element import Figure   # See: https://github.com/python-visualization/folium/blob/master/examples/WidthHeight.ipynb

I have created a small helper function to display in a map all items in the passed dataframe (df):

In [193]:
def display_map(df, location=[43.714915, -79.343565], zoom_start=10):
    fig = Figure(width=1000, height=300)
    map = folium.Map(location=location, zoom_start=zoom_start, width='100%',height='100%')
    for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
        folium.CircleMarker(
            [lat, lng],
            radius=6,
            color='blue',
            weight=1,
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            popup=folium.Popup(f'{neighborhood}, {borough}', parse_html=True),
            parse_html=False).add_to(map)  
    fig.add_child(map)
    display(fig)

In [194]:
display_map(df4)

Now let's create a subset of our dataset in order to reduce the data to work with.  
Our subset will ONLY have the postal codes in **Toronto**:

In [195]:
# Get only the records for which the column 'Borough' contains the word 'Toronto'
# Remember the pandas.str.* functions work with regular expressions
df_neighborhoods = df4[ df4['Borough'].str.contains('(?i)toronto') ]
print( df_neighborhoods.shape )
display( df_neighborhoods.head() )

(38, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [196]:
display_map(df_neighborhoods, location=[43.667814, -79.393083], zoom_start=11)

<br><br><br><br><br><br>

---

## Use FourSquare to get top venues in each neighborhood:

Enter your FourSquare credentials below:

In [197]:
CLIENT_ID = 'WWWWWWWWWWWWWWWW' # your Foursquare ID
CLIENT_SECRET = 'WWWWWWWWWWWWWWWW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2G2DFHUHZD5RUHPNWDN2BYBBWAHUNDXQ2ZE0YTJPSBNXSEEF
CLIENT_SECRET:3QSZ3QT0KMQPMWUT3PI4P5BL1FPPID5RQL5THIVNJTT133FI


<br>

I have created a helper function to go through a list of locations and gather FourSquare top venues around them.  
The helper function returns a dataframe.

In [198]:
def get_top_venues_in_locations(names, latitudes, longitudes, radius=500, limit=100):
    
    ls_neighborhoods = []
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        # Each item in ls_neighborhoods is a list (neighborhoods) of list (venues) of tuples (venue props)
        ls_neighborhoods.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    # We need to flatten the list (neighborhoods) of list (venues) of tuples (venue props) into a list of venues.
    # To do so we make use of the line:
    #   [tp_venue    for ls_venues_in_neighborhood in ls_neighborhoods  for tp_venue in ls_venues_in_neighborhood]
    #    <--|--->    <------------------ 1st lopp ------------------->  <---------------- 2nd loop ------------->
    #       This is the item from last loop (2nd loop in our case) to keep
    top_venues = pd.DataFrame([tp_venue    for ls_venues_in_neighborhood in ls_neighborhoods  for tp_venue in ls_venues_in_neighborhood])
    top_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
                          'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    return(top_venues)

In [199]:
print('Getting top venues in each Toronto neighborhood...')
df_top_venues = get_top_venues_in_locations(names=df_neighborhoods['Neighborhood'], latitudes=df_neighborhoods['Latitude'], longitudes=df_neighborhoods['Longitude'])
print( df_top_venues.shape )
display( df_top_venues.head() )

Getting top venues in each Toronto neighborhood...
(1710, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
1,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
2,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West,Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [200]:
# Let's check how many venues were returned for each neighborhood
df_top_venues.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"Brockton,Exhibition Place,Parkdale Village",20,20,20,20,20,20
Business Reply Mail Processing Centre 969 Eastern,17,17,17,17,17,17
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",13,13,13,13,13,13


In [201]:
# Let's find out how many unique categories can be curated from all the returned venues
print(f"There are {len(df_top_venues['Venue Category'].unique())} uniques categories.")

There are 240 uniques categories.


<br><br><br><br><br><br>

---

## Get most frequent categories for the top venues in each neighborhood:

We will create a flag-table with all the categories in the **df_flagged_categories_of_top_venues** table:

In [202]:
# one hot encoding
df_flagged_categories_of_top_venues = pd.get_dummies(df_top_venues[['Venue Category']], prefix="", prefix_sep="")
df_flagged_categories_of_top_venues.drop(['Neighborhood'], axis=1, inplace=True)
columns = list(df_flagged_categories_of_top_venues.columns)  # print( str(columns) )  # To print FULL list without truncation
# add neighborhood column back to dataframe
df_flagged_categories_of_top_venues['Neighborhood'] = df_top_venues['Neighborhood'] 
# move neighborhood column to the first column
columns = ['Neighborhood'] + columns
df_flagged_categories_of_top_venues = df_flagged_categories_of_top_venues[columns]

df_flagged_categories_of_top_venues.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West,Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<br>

Now we need to calculate how frequent each category is.  
We need to make use of groupby for this task.

In [203]:
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
df_category_frequencies_of_top_venues_per_neighborhood = df_flagged_categories_of_top_venues.groupby('Neighborhood').mean().reset_index()
df_category_frequencies_of_top_venues_per_neighborhood.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Adelaide,King,Richmond",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0,0.0,0.076923,0.076923,0.076923,0.153846,0.153846,0.153846,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<br>

Now that we have the table that tells use how frequent is each category in each neighborhood, let's examine this table.  
Let's display the **5 most frequent categories** in each neighborhood:

In [120]:
# Let's print each neighborhood along with the top 5 most common categories
df = df_category_frequencies_of_top_venues_per_neighborhood
num_top_venues = 5  # In reality we mean "categories", not venues (or we can read it as "venue TYPE")
ihood = 0 - 1
for hood in df['Neighborhood']:
    ihood += 1
    if (ihood>=3): break
    print("----"+hood+"----")
    temp = df[df['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
print('... etc, etc')

----Adelaide,King,Richmond----
             venue  freq
0      Coffee Shop  0.06
1             Café  0.04
2  Thai Restaurant  0.04
3       Steakhouse  0.04
4              Bar  0.04


----Berczy Park----
            venue  freq
0     Coffee Shop  0.07
1    Cocktail Bar  0.05
2      Restaurant  0.04
3      Steakhouse  0.04
4  Farmers Market  0.04


----Brockton,Exhibition Place,Parkdale Village----
            venue  freq
0  Breakfast Spot  0.10
1            Café  0.10
2     Coffee Shop  0.10
3   Grocery Store  0.05
4   Burrito Place  0.05


... etc, etc


<br><br>

Ok, finally we need to create a table with **ONLY** the **10 most frequent categories** in each neighborhood:

In [204]:
def return_most_common_categories(row, num_top_categories):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_categories]


df1 = df_category_frequencies_of_top_venues_per_neighborhood
df2 = None    
num_top_categories = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_categories):
    try:    columns.append('{}{} Most Common Category'.format(ind+1, indicators[ind]))
    except: columns.append('{}th Most Common Category'.format(ind+1))

# create a new dataframe
df2 = pd.DataFrame(columns=columns)
df2['Neighborhood'] = df1['Neighborhood']

for ind in np.arange(df1.shape[0]):
    df2.iloc[ind, 1:] = return_most_common_categories(df1.iloc[ind, :], num_top_categories)

df_top_categories_of_top_venues_per_neighborhood = df2
df_top_categories_of_top_venues_per_neighborhood.head()

Unnamed: 0,Neighborhood,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
0,"Adelaide,King,Richmond",Coffee Shop,Café,Thai Restaurant,Steakhouse,Bar,Sushi Restaurant,Gym,American Restaurant,Bakery,Burger Joint
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Seafood Restaurant,Restaurant,Farmers Market,Pub,Café,Cheese Shop,Steakhouse
2,"Brockton,Exhibition Place,Parkdale Village",Breakfast Spot,Café,Coffee Shop,Convenience Store,Climbing Gym,Burrito Place,Stadium,Bar,Restaurant,Caribbean Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Pizza Place,Auto Workshop,Comic Shop,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Terminal,Airport Service,Airport Lounge,Plane,Airport Gate,Sculpture Garden,Harbor / Marina,Airport Food Court,Airport,Boat or Ferry


<br><br><br><br><br><br>

---

## Group the neighborhoods in clusters
According to the 10 venues' most frequent categories in each neighborhood.

We are going to use **K-Means** model for this task.  
Let's import the lib:

In [123]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

Ok, time to run K-Means. Our X matrix is the **df_category_frequencies_of_top_venues_per_neighborhood** but without the `Neighborhood` column since we don't want that column to be a feature.

In [206]:
# Run k-means clustering
X = df_category_frequencies_of_top_venues_per_neighborhood.drop('Neighborhood', 1)
mkme = KMeans(n_clusters=5, random_state=0).fit(X)
if ('Cluster Labels' in df_top_categories_of_top_venues_per_neighborhood.columns):
    df_top_categories_of_top_venues_per_neighborhood['Cluster Labels'] = mkme.labels_
else:
    df_top_categories_of_top_venues_per_neighborhood.insert(0, 'Cluster Labels', mkme.labels_)
df_top_categories_of_top_venues_per_neighborhood_with_coords = df_neighborhoods.join(df_top_categories_of_top_venues_per_neighborhood.set_index('Neighborhood'), on='Neighborhood')
df_top_categories_of_top_venues_per_neighborhood_with_coords.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Coffee Shop,Pub,Dim Sum Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,1,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Yoga Studio,Bookstore,Brewery,Bubble Tea Shop,Restaurant,Café
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,1,Sandwich Place,Pet Store,Brewery,Italian Restaurant,Food & Drink Shop,Steakhouse,Fish & Chips Shop,Fast Food Restaurant,Light Rail Station,Burger Joint
43,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Café,Coffee Shop,Gastropub,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Bookstore,Brewery,Seafood Restaurant
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Swim School,Bus Line,Yoga Studio,Diner,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


<br><br><br>

Let's display the predicted clusters in a map:

In [207]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [208]:
df = df_top_categories_of_top_venues_per_neighborhood_with_coords

# set color scheme for the clusters
x = np.arange(mkme.n_clusters)
ys = [i + x + (i*x)**2 for i in range(mkme.n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

fig = Figure(width=1000, height=300)
map = folium.Map(location=[43.667814, -79.393083], zoom_start=11, width='100%',height='100%')
for lat, lng, neighborhood, cluster in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Cluster Labels']):
    folium.CircleMarker(
        [lat, lng],
        radius=6,
        color='black',
        weight=1,
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        popup=folium.Popup(f'{neighborhood}, {borough}', parse_html=True),
        parse_html=False).add_to(map)  
fig.add_child(map)
display(fig)

<br><br><br><br>

Finally we can display the items each cluster is made of in order to identify which categories made the classifier decide to put them in a cluster:

In [176]:
def display_cluster_top_categories_of_top_venues_per_neighborhood(n_cluster=0):
    df = df_top_categories_of_top_venues_per_neighborhood_with_coords
    rows_boolean_mask = (df['Cluster Labels']==n_cluster)
    cols_index_mask   = [[1]+list(range(5, df.shape[1]))]
    cols_names_mask   = df.columns[cols_index_mask]
    # Remember the current columns indexes are: [0:'PostalCode', 1:'Borough', 2:'Neighborhood',
    #                                            3:'Latitude', 4:'Longitude', 5:'Cluster Labels',
    #                                            6:'1st Most Common Category', 7:'2nd Most Common Category'...
    # We want: [1:'Borough', 5:'Cluster Labels', 6:'1st Most Common Category', ...]
    df = df.loc[rows_boolean_mask, cols_names_mask]
    print(df.shape)
    display(df)


In [178]:
for n_cluster in range(0,mkme.n_clusters):
    print_html(f'<span style="font-size:2em;font-weight:bold">CLUSTER:<span style="margin-left:0.3em;font-size:1em;font-weight:normal">[{n_cluster}]</span></span>')
    display_cluster_top_categories_of_top_venues_per_neighborhood(n_cluster)
    print('\n')

(2, 12)


Unnamed: 0,Borough,Cluster Labels,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
37,East Toronto,0,Health Food Store,Coffee Shop,Pub,Dim Sum Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
49,Central Toronto,0,Coffee Shop,Pub,Pizza Place,American Restaurant,Light Rail Station,Medical Center,Sports Bar,Bagel Shop,Supermarket,Sushi Restaurant






(32, 12)


Unnamed: 0,Borough,Cluster Labels,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
41,East Toronto,1,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Yoga Studio,Bookstore,Brewery,Bubble Tea Shop,Restaurant,Café
42,East Toronto,1,Sandwich Place,Pet Store,Brewery,Italian Restaurant,Food & Drink Shop,Steakhouse,Fish & Chips Shop,Fast Food Restaurant,Light Rail Station,Burger Joint
43,East Toronto,1,Café,Coffee Shop,Gastropub,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Bookstore,Brewery,Seafood Restaurant
45,Central Toronto,1,Park,Clothing Store,Asian Restaurant,Sandwich Place,Dance Studio,Food & Drink Shop,Burger Joint,Hotel,Breakfast Spot,Pizza Place
46,Central Toronto,1,Sporting Goods Shop,Coffee Shop,Yoga Studio,Rental Car Location,Salon / Barbershop,Sandwich Place,Mexican Restaurant,Metro Station,Chinese Restaurant,Fast Food Restaurant
47,Central Toronto,1,Sandwich Place,Pizza Place,Dessert Shop,Sushi Restaurant,Coffee Shop,Italian Restaurant,Café,Restaurant,Indian Restaurant,Flower Shop
48,Central Toronto,1,Restaurant,Gym,Playground,Tennis Court,Donut Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Eastern European Restaurant
51,Downtown Toronto,1,Coffee Shop,Restaurant,Café,Pizza Place,Italian Restaurant,Pub,Bakery,Pharmacy,Sandwich Place,Japanese Restaurant
52,Downtown Toronto,1,Japanese Restaurant,Coffee Shop,Sushi Restaurant,Gay Bar,Restaurant,Burger Joint,Pub,Bubble Tea Shop,Burrito Place,Café
53,Downtown Toronto,1,Coffee Shop,Café,Bakery,Pub,Park,Breakfast Spot,Mexican Restaurant,Theater,Italian Restaurant,Historic Site






(1, 12)


Unnamed: 0,Borough,Cluster Labels,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
63,Central Toronto,2,Garden,Yoga Studio,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant






(2, 12)


Unnamed: 0,Borough,Cluster Labels,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
50,Downtown Toronto,3,Park,Playground,Trail,Yoga Studio,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
64,Central Toronto,3,Park,Jewelry Store,Sushi Restaurant,Trail,Yoga Studio,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Electronics Store






(1, 12)


Unnamed: 0,Borough,Cluster Labels,1st Most Common Category,2nd Most Common Category,3rd Most Common Category,4th Most Common Category,5th Most Common Category,6th Most Common Category,7th Most Common Category,8th Most Common Category,9th Most Common Category,10th Most Common Category
44,Central Toronto,4,Park,Swim School,Bus Line,Yoga Studio,Diner,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store




