# IBM Data Science Professional Certificate
## Part 9: Applied Data Science Capstone - Week 3

<a id="0"></a>
## Notebook Table of Contents:
* [Part 1: Scraping Wikipedia and Create a Cleaned Dataframe](#1)
    * [Requirement 1: Create New Workbook.](#1.1)
    * [Requirement 2: Scrape the Data from Wikipedia.](#1.2)
    * [Requirement 3: Generate Clean Dataframe.](#1.3)
    * [Requirement 4: Submit to Github and Link.](#1.4)
* [Part 2: Geocoding the Dataframe.](#2)
* [Part 3: Visually Explore and Cluster Neighbourhoods.](#3)

### This is Part 1: Scraping Wikipedia and Create a Cleaned Dataframe <a name="1"></a>
For this part of the assignment, I am exploring and clustering the neighborhoods in Toronto.
Each portion of the problem is broken out on it's own.

#### Requirement 1: Create a new Notebook - Done. <a name="1.1"></a>

[Back to Table of Contents](#0)

#### Requirement 2: Scrape the data from Wikipedia <a name="1.2"></a>

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

import requests # library to handle requests
from bs4 import BeautifulSoup

print('Done importing.')

Done importing.


In [2]:
#assign the get request to a variable
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [3]:
#check the status of the request - should read 200 for success.
website_url.status_code

200

In [4]:
#Pull the html from the website as text
fullPage = BeautifulSoup(website_url.text,"lxml")
#print(fullPage.prettify())      
#uncomment print to see all the webpage html - takes up too much space, so commented out.

[Back to Table of Contents](#0)

#### Requirement 3 - Turn the scraped Wikipedia data into a pandas dataframe. <a name="1.3"></a>

In [5]:
#Isolate just the table.
wikiTable = fullPage.find("table",{"class":"wikitable sortable"})

In [6]:
#Isolate just the rows in the table.
tableRows = wikiTable.findAll('tr')

Sub-criteria 1: The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [7]:
#Loop through the rows of the table and add them to a dataframe
postCode = []
for tr in tableRows:
    td = tr.findAll('td')
    row = [tr.text for tr in td] # .text is important here so we don't get html tags.
    if row:
            postCode.append(row)
            
df = pd.DataFrame(postCode,columns=["Postal Code", "Borough", "Neighbourhood"])

#Clean up the messy carriage returns.
df = df.replace(r"\n","", regex=True)

#Display the dataframe header to check.
df.head()      # Still has "Not Assigned"

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Sub-criteria 2: Ignore cells with a Borough that is "Not Assigned"

In [8]:
#ignore cells with Burrow = "Not assigned"
df = df[df.Borough != "Not assigned"]

#Display the dataframe
df.head()     # Still needs to be grouped

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Sub-criteria 3: Group neighbourhoods based on their PostalCode; comma separated.

In [9]:
#Grouping Neighborhoods by PostalCode and Borough - comma separated.
df = df.groupby(['Postal Code', 'Borough'])['Neighbourhood'].agg(lambda x: ', '.join(set(x))).reset_index()

#Display the dataframe
df.head()      # Still need to assign "Not Assigned" Neighborhoods to Borough name.

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"West Hill, Guildwood, Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
#Specific Check: M5A from example to make sure it matches.
df.loc[df["Postal Code"] == "M5A"]

Unnamed: 0,Postal Code,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


Sub-criteria 4: If Borough has name and Neighbourhood does not, use Borough name as Neighborhood name. (See M7A.)

In [11]:
#Check to see which rows need fixed.
df[df["Neighbourhood"].str.contains("Not assign")]

Unnamed: 0,Postal Code,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


In [12]:
#replace unassigned Neighborhoods with Borough name
df["Neighbourhood"] = np.where(df["Neighbourhood"].str.contains("Not assign"), df["Borough"], df["Neighbourhood"])

#Check M7A from example to make sure it matches.
df.loc[df["Postal Code"] == "M7A"]

Unnamed: 0,Postal Code,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


In [13]:
df.head()     #Final check on head

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"West Hill, Guildwood, Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
df.shape

(103, 3)

[Back to Table of Contents](#0)

#### Requirement 4: Submit a link to Notebook on Github. <a name="1.4"></a>
 ~~ Should have worked if you're reading this ~~

[Back to Table of Contents](#0)

### This is Part 2: Geocoding the Dataframe <a name="2"></a>
For this part of the assignment, I take the previously created dataframe and assign the appropriate Latitude / Longtitude values to each PostalCode.

In [15]:
# Retrieve csv file with lat/lon data.
!wget -q -O 'Geospatial_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded')

Data downloaded


In [16]:
# Load the data to a dataframe.
geo_data = pd.read_csv("Geospatial_data.csv")

geo_data.head() # Verify working as intended.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
# Merge the dataframe from part 1 with the geospatial data.
mergedData = pd.merge(df, geo_data, on="Postal Code")

mergedData.head() # Verify working as intended.

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"West Hill, Guildwood, Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


#### Submit a link to Notebook on Github.
 ~~ Should have worked if you're reading this ~~

[Back to Table of Contents](#0)

### This is Part 3: Visually Explore and Cluster Neighbourhoods <a name="3"></a>
For this part of the assignment, I visualize, explore and cluster the neighborhoods in Toronto. Modeled to follow the NY exercise.

In [18]:
#importing additional libraries for visualizations

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Additional Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Additional Libraries imported.


In [19]:
# The code was removed by Watson Studio for sharing.

Your credentials:
CLIENT_ID: 32VDYKIEP2AENDZ0JIRNFCVSRUYPLOTPIT3MP5KT1GT3T52E
CLIENT_SECRET:FOAFXOEI1IM5PHP5HEOZJXJXUJ3DCIZFBX10FJJQCEDZD5MK


In [20]:
# Define borough's latitude and longtitude values.

borough_name = mergedData.loc[0, "Borough"] # Borough Name
borough_code = mergedData.loc[0, "Postal Code"] # Borough Postal Code

borough_lat = mergedData.loc[0, "Latitude"] # Borough Latitude Value
borough_lon = mergedData.loc[0, "Longitude"] # Borough Longtitude Value

In [21]:
# create map of all Boroughs in our data using latitude and longitude values
map_mergedData = folium.Map(location=[borough_lat, borough_lon], zoom_start=11)

# add markers to map
for lat, lng, postcode, borough, neighbourhood in zip(mergedData["Latitude"], mergedData["Longitude"], mergedData["Postal Code"], mergedData["Borough"], mergedData["Neighbourhood"]):
    label="Borough: {} ({});   Neighbourhood(s): {}".format(borough, postcode, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='black', #black border for all markers
        fill=True,
        fill_color='seagreen', #seagreen fill
        fill_opacity=0.7,
        parse_html=True).add_to(map_mergedData)  
    
map_mergedData

In [22]:
# Narrow the scope of the data to just Borough's containing "Toronto"
toronto_data = mergedData[mergedData["Borough"].str.contains("Toronto")].reset_index(drop=True)

# Check dataframe
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"Riverdale, The Danforth West",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [23]:
# Define Toronto borough's latitude and longtitude values.

t_borough_name = toronto_data.loc[0, "Borough"] # Borough Name
t_borough_neighbourhood = toronto_data.loc[0, "Neighbourhood"] # Borough Neighbourhood

t_borough_lat = toronto_data.loc[0, "Latitude"] # Borough Latitude Value
t_borough_lon = toronto_data.loc[0, "Longitude"] # Borough Longtitude Value

In [24]:
# create map of Toronto Boroughs in our data using latitude and longitude values
map_toronto_data = folium.Map(location=[t_borough_lat, t_borough_lon], zoom_start=12)

# add markers to map
for lat, lng, postcode, borough, neighbourhood in zip(toronto_data["Latitude"], toronto_data["Longitude"], toronto_data["Postal Code"], toronto_data["Borough"], toronto_data["Neighbourhood"]):
    label="Borough: {} ({});   Neighbourhood(s): {}".format(borough, postcode, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='black', #black border for all markers
        fill=True,
        fill_color='orange', #orange fill
        fill_opacity=0.7,
        parse_html=True).add_to(map_toronto_data)  
    
map_toronto_data

In [25]:
# The code was removed by Watson Studio for sharing.

'https://api.foursquare.com/v2/venues/explore?&client_id=32VDYKIEP2AENDZ0JIRNFCVSRUYPLOTPIT3MP5KT1GT3T52E&client_secret=FOAFXOEI1IM5PHP5HEOZJXJXUJ3DCIZFBX10FJJQCEDZD5MK&v=20190201&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

In [26]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c560449dd57971927729c6f'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4bb6b9446edc76b0d771311c-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/fastfood_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d16e941735',
         'name': 'Fast Food Restaurant',
         'pluralName': 'Fast Food Restaurants',
         'primary': True,
         'shortName': 'Fast Food'}],
       'id': '4bb6b9446edc76b0d771311c',
       'location': {'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'Morningside & Sheppard',
        'distance': 387,
        'formattedAddress': ['Toronto ON', 'Canada'],
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'ln

In [27]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [28]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


In [29]:
print("{} venue(s) were returned by Foursquare.".format(nearby_venues.shape[0]))

1 venue(s) were returned by Foursquare.


### Explore Neighborhoods in Toronto

In [30]:
def getTorontoVenues(t_borough_name, t_borough_neighbourhood, t_borough_lat, t_borough_lon, radius=500):
    
    venues_list=[] # create empty dataframe
    for name, neighbourhood, lat, lng in zip(t_borough_name, t_borough_neighbourhood, t_borough_lat, t_borough_lon):
        print(name,"(",neighbourhood,")")
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            neighbourhood,
            lat, 
            lng, 
            v["venue"]["name"], 
            v["venue"]["location"]["lat"], 
            v["venue"]['location']["lng"],  
            v["venue"]["categories"][0]["name"]) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ["Borough",
                             "Neighbourhood",
                            "Borough Latitude", 
                  "Borough Longitude", 
                  "Venue", 
                  "Venue Latitude", 
                  "Venue Longitude", 
                  "Venue Category"]
    
    return(nearby_venues)

In [31]:
# Creating a new dataframe by running the function 'getTorontoVenues' on our toronto only data.
toronto_venues = getTorontoVenues(t_borough_name=toronto_data["Borough"],
                                  t_borough_neighbourhood=toronto_data["Neighbourhood"], 
                                  t_borough_lat=toronto_data["Latitude"],
                                  t_borough_lon=toronto_data["Longitude"]
                                  )

East Toronto ( The Beaches )
East Toronto ( Riverdale, The Danforth West )
East Toronto ( The Beaches West, India Bazaar )
East Toronto ( Studio District )
Central Toronto ( Lawrence Park )
Central Toronto ( Davisville North )
Central Toronto ( North Toronto West )
Central Toronto ( Davisville )
Central Toronto ( Moore Park, Summerhill East )
Central Toronto ( Summerhill West, South Hill, Forest Hill SE, Rathnelly, Deer Park )
Downtown Toronto ( Rosedale )
Downtown Toronto ( St. James Town, Cabbagetown )
Downtown Toronto ( Church and Wellesley )
Downtown Toronto ( Harbourfront, Regent Park )
Downtown Toronto ( Garden District, Ryerson )
Downtown Toronto ( St. James Town )
Downtown Toronto ( Berczy Park )
Downtown Toronto ( Central Bay Street )
Downtown Toronto ( Richmond, Adelaide, King )
Downtown Toronto ( Union Station, Harbourfront East, Toronto Islands )
Downtown Toronto ( Toronto Dominion Centre, Design Exchange )
Downtown Toronto ( Victoria Hotel, Commerce Court )
Central Toronto

In [32]:
# Get an idea of the shape and make sure everything looks correct in dataframe.
print(toronto_venues.shape)
toronto_venues.head()

(1698, 8)


Unnamed: 0,Borough,Neighbourhood,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,East Toronto,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
1,East Toronto,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
2,East Toronto,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
3,East Toronto,The Beaches,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop
4,East Toronto,"Riverdale, The Danforth West",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Now that we have a dataframe with all the venues in each Borough that we wish to analyze, we can start reviewing different aspects.

In [33]:
toronto_venues.groupby(["Borough", "Neighbourhood"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Neighbourhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Central Toronto,Davisville,35,35,35,35,35,35
Central Toronto,Davisville North,8,8,8,8,8,8
Central Toronto,"Forest Hill West, Forest Hill North",4,4,4,4,4,4
Central Toronto,Lawrence Park,4,4,4,4,4,4
Central Toronto,"Moore Park, Summerhill East",3,3,3,3,3,3
Central Toronto,"North Midtown, The Annex, Yorkville",24,24,24,24,24,24
Central Toronto,North Toronto West,22,22,22,22,22,22
Central Toronto,Roselawn,1,1,1,1,1,1
Central Toronto,"Summerhill West, South Hill, Forest Hill SE, Rathnelly, Deer Park",15,15,15,15,15,15
Downtown Toronto,Berczy Park,55,55,55,55,55,55


Now we check how many unqiue categories there are in these returned venues.

In [34]:
print("There are {} uniques categories.".format(len(toronto_venues["Venue Category"].unique())))

There are 238 uniques categories.


### Analyze Each Neighborhood

"one hot" encoding allows us to look at the variables in a binary format (so we can use analytical tools)

In [35]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[["Venue Category"]], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot["Neighbourhood"] = toronto_venues["Neighbourhood"] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# add borough column back to dataframe
toronto_onehot["Borough"] = toronto_venues["Borough"] 

# move borough column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,East Toronto,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,East Toronto,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,East Toronto,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,East Toronto,The Beaches,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,East Toronto,"Riverdale, The Danforth West",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
# check the new dataframe size
toronto_onehot.shape

(1698, 240)

In [37]:
# group rows by neighbourhood based on the mean frequency of each category.
toronto_grouped = toronto_onehot.groupby("Neighbourhood").mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,...,0.0,0.0,0.012195,0.0,0.012195,0.0,0.012195,0.0,0.0,0.012195
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Church and Wellesley,0.011364,0.011364,0.0,0.0,0.0,0.0,0.0,0.0,0.011364,...,0.0,0.0,0.0,0.011364,0.0,0.011364,0.0,0.011364,0.0,0.011364
6,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0


In [38]:
# Check the grouped size of the dataframe
toronto_grouped.shape

(38, 239)

In [39]:
# Print each Neighbourhood with the top 5 most common venues.
num_top_venues = 5

for hood in toronto_grouped["Neighbourhood"]:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped["Neighbourhood"] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
          venue  freq
0   Coffee Shop  0.07
1    Restaurant  0.05
2  Cocktail Bar  0.05
3          Café  0.04
4        Bakery  0.04


----Brockton, Exhibition Place, Parkdale Village----
                   venue  freq
0         Breakfast Spot  0.10
1            Coffee Shop  0.10
2                   Café  0.10
3  Performing Arts Venue  0.05
4                 Office  0.05


----Business Reply Mail Processing Centre 969 Eastern----
              venue  freq
0       Yoga Studio  0.06
1     Auto Workshop  0.06
2              Park  0.06
3        Comic Shop  0.06
4  Recording Studio  0.06


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.16
1                Café  0.06
2  Italian Restaurant  0.05
3        Burger Joint  0.04
4                 Bar  0.04


----Christie----
               venue  freq
0               Café  0.20
1      Grocery Store  0.20
2               Park  0.13
3         Baby Store  0.07
4  Convenience Store  0.07


----Church 

In [40]:
# Put information into new dataframe sorted for easy to digest info

#define function:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [41]:
num_top_venues = 10

indicators = ["st", "nd", "rd"]

# create columns according to number of top venues
columns = ["Neighbourhood"]
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted["Neighbourhood"] = toronto_grouped["Neighbourhood"]

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Farmers Market,Seafood Restaurant,Bakery,Café,Italian Restaurant,Beer Bar,Steakhouse
1,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Bar,Burrito Place,Stadium,Italian Restaurant,Nightclub,Caribbean Restaurant,Office
2,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Garden Center,Garden,Light Rail Station,Fast Food Restaurant,Farmers Market,Park,Comic Shop,Recording Studio
3,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Burger Joint,Bar,Sandwich Place,Bubble Tea Shop,Thai Restaurant,Chinese Restaurant,Salad Place
4,Christie,Café,Grocery Store,Park,Convenience Store,Coffee Shop,Baby Store,Restaurant,Diner,Italian Restaurant,Nightclub


### Cluster Neighborhoods
Now to run k-means to cluster the neighbourhood into 5 clusters.

In [42]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop("Neighbourhood", 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [43]:
# New dataframe with cluster # and top 10 venues for each neighbourhood.

# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index("Neighbourhood"), on="Neighbourhood")

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Coffee Shop,Pub,Neighborhood,Yoga Studio,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space
1,M4K,East Toronto,"Riverdale, The Danforth West",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Bookstore,Italian Restaurant,Cosmetics Shop,Brewery,Bubble Tea Shop,Restaurant,Caribbean Restaurant
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Fish & Chips Shop,Intersection,Pet Store,Pizza Place,Pub,Movie Theater,Sandwich Place,Burrito Place,Burger Joint,Brewery
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Italian Restaurant,Bakery,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Park,Dim Sum Restaurant,Swim School,Bus Line,Yoga Studio,Donut Shop,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Farmers Market


In [44]:
# Visualize the Clusters

# create map
map_clusters = folium.Map(location=[t_borough_lat, t_borough_lon], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

In [45]:
# Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Bookstore,Italian Restaurant,Cosmetics Shop,Brewery,Bubble Tea Shop,Restaurant,Caribbean Restaurant
2,East Toronto,0,Fish & Chips Shop,Intersection,Pet Store,Pizza Place,Pub,Movie Theater,Sandwich Place,Burrito Place,Burger Joint,Brewery
3,East Toronto,0,Café,Coffee Shop,Italian Restaurant,Bakery,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
4,Central Toronto,0,Park,Dim Sum Restaurant,Swim School,Bus Line,Yoga Studio,Donut Shop,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Farmers Market
5,Central Toronto,0,Dog Run,Food & Drink Shop,Burger Joint,Gym,Park,Breakfast Spot,Hotel,Sandwich Place,Farmers Market,Donut Shop
6,Central Toronto,0,Sporting Goods Shop,Coffee Shop,Clothing Store,Salon / Barbershop,Spa,Park,Mexican Restaurant,Miscellaneous Shop,Sandwich Place,Yoga Studio
7,Central Toronto,0,Pizza Place,Sandwich Place,Dessert Shop,Italian Restaurant,Café,Sushi Restaurant,Coffee Shop,Seafood Restaurant,Pharmacy,Brewery
9,Central Toronto,0,Pub,Coffee Shop,Pizza Place,American Restaurant,Restaurant,Bagel Shop,Fried Chicken Joint,Sports Bar,Supermarket,Sushi Restaurant
11,Downtown Toronto,0,Coffee Shop,Restaurant,Park,Chinese Restaurant,Pizza Place,Italian Restaurant,Café,Bakery,Pub,Japanese Restaurant
12,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Burger Joint,Pub,Pizza Place,Men's Store,Mediterranean Restaurant


In [46]:
# Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,1,Garden,Yoga Studio,Food,Fish Market,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space


In [47]:
# Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Downtown Toronto,2,Park,Playground,Trail,Yoga Studio,Ethiopian Restaurant,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
23,Central Toronto,2,Jewelry Store,Trail,Sushi Restaurant,Park,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Yoga Studio


In [48]:
# Cluster 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,3,Coffee Shop,Pub,Neighborhood,Yoga Studio,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space


In [49]:
# Cluster 5
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Central Toronto,4,Summer Camp,Trail,Restaurant,Yoga Studio,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space


# Thank you for taking time to review my work. If you do not give full credit, please note why so I can improve.

[Back to Table of Contents](#0)