# IBM Data Science
## Coursera Capstone notebook Week 3
This notebook will be used to analyse location data in Toronto for the capstone project of the IBM Data Science course.

**All 3 parts are in this notebook - please scroll to the appropriate part**

### Part 1: Setting up the notebook
In this section, we install the necessary packages, scrape the data from <a href=https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M>wikipedia</a> using Beautiful soup, and explore and clean the data

In [1]:
"""install the necessary packages"""
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json 
import numpy as np
#!pip install beautifulsoup4 ## These are commented out as the packages are now installed
#!pip install lxml
#!conda install -c conda-forge geopy --yes
#!pip install requests
from bs4 import BeautifulSoup as bs
import requests
pd.set_option("display.precision", 3)

In [2]:
"""use beautiful soup to import the data"""
source_html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text #obtains source code as text
soup = bs(source_html, 'lxml') #uses beautiful soup to parse the source code
# print(soup.prettify()) #prints the html with appropriate indents - this was used to identify which arguments to use to find the table etc.
wikitable = soup.tbody #accesses just the table

In [3]:
"""Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
def tableDataText(table):   
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td') ) # data row       
    return rows

In [4]:
wikiclean = tableDataText(wikitable) #apply the method above to our table from wikipedia

nbhd = pd.DataFrame(wikiclean[1:], columns=wikiclean[0]) #convert to a dataframe

nbhd = nbhd[nbhd.Borough != 'Not assigned']#remove any boroughs with 'not assigned'
nbhd['Neighbourhood'] = nbhd['Neighbourhood'].replace("Not assigned",nbhd['Borough']) #replace not assigned with borough name

nbhd2 = nbhd.groupby(['Postcode'])['Neighbourhood'].apply(", ".join) #groups neighbourhood with same postcode, add comma between neighbourhood names 

nbhd2 = nbhd2.rename(index='Neighbourhoods',columns={'Neighbourhood':'Neighbourhoods'}) #change value title so can add to nbhd df
nbhd = nbhd.join(nbhd2,on='Postcode',how='inner') # joins the dfs using the post code as the index
nbhd = nbhd.drop(['Neighbourhood'],axis=1) #removes the original Neighbourhood column
nbhd = nbhd.drop_duplicates() #removes the duplicate entries

nbhd = nbhd.sort_values(by=['Postcode']) #sorts alphabetically
nbhd = nbhd.reset_index(drop=True) #resets index
nbhd.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [5]:
nbhd.shape

(103, 3)

### Part 2: obtaining latitude and longitude
In this part, we obtain the latitude and longitude data for each Borough

In [6]:
# !pip install geocoder # install the necessary package
import geocoder # import geocoder

In [7]:
latlong = pd.DataFrame(columns = ['Lat','Long']) #Create DF for latlong data
latlong

Unnamed: 0,Lat,Long


In [8]:
postal_code = nbhd.Postcode #get list of postcodes
for i in postal_code:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(i))  #using arcgis as google rejected the requests
                                                            #Please note some lat/longs might be different to the rubric
    lat = g.latlng[0]
    long = g.latlng[1]
    latlong = latlong.append({'Lat': lat,'Long': long},ignore_index=True) #fill lat long df

nbhd = pd.concat([nbhd,latlong],1)#combine the dfs

In [9]:
nbhd.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods,Lat,Long
0,M1B,Scarborough,"Rouge, Malvern",43.812,-79.196
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.786,-79.159
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.766,-79.175
3,M1G,Scarborough,Woburn,43.768,-79.218
4,M1H,Scarborough,Cedarbrae,43.77,-79.239


### Part 3: clustering and analysis
_In this part we explore the data and cluster neighbourhoods_

In [10]:
"""Import the necessary libraries for analysis"""
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim ## convert an address into latitude and longitude values
print('Installed!')

Installed!


_Let's first look at a map of Toronto with the boroughs marked._

In [11]:
# Identify Toronto's overall coordinates
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))
# create map of Toronto 
map_1 = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(nbhd['Lat'], nbhd['Long'], nbhd['Borough'], nbhd['Neighbourhoods']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_1)  
    
map_1

The geographical coordinates of Toronto are 43.653963, -79.387207.


_We can see that the postcodes of Toronto are roughly evenly spread across the city, with a denser cluster in the centre of town._


_Let's look closer at the centre of town:_

In [12]:
centre_districts = ['Downtown Toronto','East Toronto','West Toronto','Central Toronto','York','East York'] #define the centre districts
centre = pd.DataFrame(columns = ['Postcode','Borough','Neighbourhoods','Lat','Long']) #Create DF for latlong data
for district in centre_districts: #fill centre DF
    x = nbhd[nbhd['Borough'] == district]
    centre = centre.append(x, sort=False)

centre = centre.reset_index(drop=True)
centre.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods,Lat,Long
0,M4W,Downtown Toronto,Rosedale,43.682,-79.378
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.668,-79.367
2,M4Y,Downtown Toronto,Church and Wellesley,43.667,-79.381
3,M5A,Downtown Toronto,Harbourfront,43.65,-79.359
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657,-79.378


In [13]:
# We're using Queen's Park as the centre of the map
address = "Queen's Park, Toronto, Ontario"
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_2 = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(centre['Lat'], centre['Long'],centre['Borough'],centre['Neighbourhoods']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='pink',
        fill_opacity=0.7,
        parse_html=False).add_to(map_2)  
    
map_2

_Now let's start using Foursquare to examine the neighbourhoods._

In [14]:
CLIENT_ID = 'YYIFSVUT20HJYMUFHSLFDBITN0EQ50VRTNHFHRF45N3DS34G' # your Foursquare ID
CLIENT_SECRET = 'WKIUN4RPKH1WC3KF44UYJACJ3ET3CLIFXSGVYDN5NXUAUKNB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

_We will examine the top 100 venues in a radius of 500 m of each postcode_

In [28]:
import json # library to handle JSON files
from pandas.io.json import json_normalize ## tranform JSON file into a pandas dataframe
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [29]:
radius = 500
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhoods', 
                  'Neighbourhoods Latitude', 
                  'Neighbourhoods Longitude', 
                  'Venue', 
                  'Venue Lat', 
                  'Venue Long', 
                  'Venue Category']
    
    return(nearby_venues)

In [30]:
Toronto_venues = getNearbyVenues(names=centre['Neighbourhoods'],
                                   latitudes=centre['Lat'],
                                   longitudes=centre['Long']
                                  )

Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Queen's Park
The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Business Reply Mail Processing Centre 969 Eastern
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction South
Parkdale, Roncesvalles
Runnymede, Swansea
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, So

In [53]:
print(Toronto_venues.shape)
Toronto_venues.head(20) #see how many results we got

(1881, 7)


Unnamed: 0,Neighbourhoods,Neighbourhoods Latitude,Neighbourhoods Longitude,Venue,Venue Lat,Venue Long,Venue Category
0,Rosedale,43.682,-79.378,Summerhill Market,43.686,-79.375,Grocery Store
1,Rosedale,43.682,-79.378,Rosedale Park,43.682,-79.379,Playground
2,Rosedale,43.682,-79.378,Whitney Park,43.682,-79.374,Park
3,Rosedale,43.682,-79.378,Scoops Convenience Boutique,43.686,-79.376,Candy Store
4,"Cabbagetown, St. James Town",43.668,-79.367,F'Amelia,43.668,-79.369,Italian Restaurant
5,"Cabbagetown, St. James Town",43.668,-79.367,Cranberries,43.668,-79.369,Diner
6,"Cabbagetown, St. James Town",43.668,-79.367,Butter Chicken Factory,43.667,-79.369,Indian Restaurant
7,"Cabbagetown, St. James Town",43.668,-79.367,Kingyo Toronto,43.666,-79.368,Japanese Restaurant
8,"Cabbagetown, St. James Town",43.668,-79.367,Murgatroid,43.667,-79.369,Restaurant
9,"Cabbagetown, St. James Town",43.668,-79.367,Merryberry Cafe + Bistro,43.667,-79.369,Café


In [32]:
Toronto_venues.groupby('Neighbourhoods').count() #see how many venues there are for each neighbourhood group

Unnamed: 0_level_0,Neighbourhoods Latitude,Neighbourhoods Longitude,Venue,Venue Lat,Venue Long,Venue Category
Neighbourhoods,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,61,61,61,61,61,61
"Brockton, Exhibition Place, Parkdale Village",68,68,68,68,68,68
Business Reply Mail Processing Centre 969 Eastern,100,100,100,100,100,100
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",71,71,71,71,71,71
"Cabbagetown, St. James Town",38,38,38,38,38,38
Caledonia-Fairbanks,10,10,10,10,10,10
Central Bay Street,99,99,99,99,99,99
"Chinatown, Grange Park, Kensington Market",72,72,72,72,72,72
Christie,11,11,11,11,11,11


_We can see that for most neighbourhood groups, we did not reach the limit of 100, so we can assume we have a representative dataset_

_Now let's analyze the neighbourhoods to find the most common type of venue for each area_

In [36]:
# one hot encoding
TNT_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhoods column back to dataframe
TNT_onehot['Neighbourhoods'] = Toronto_venues['Neighbourhoods'] 

# move neighborhood column to the first column
fixed_columns = [TNT_onehot.columns[-1]] + list(TNT_onehot.columns[:-1])
TNT_onehot = TNT_onehot[fixed_columns]

# group rows by neighbourhood and take the mean of the frequency of occurrence of each category
TNT_grouped = TNT_onehot.groupby('Neighbourhoods').mean().reset_index()
TNT_grouped.head(5)

Unnamed: 0,Neighbourhoods,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Basketball Stadium,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Butcher,Café,Camera Store,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Auditorium,College Gym,College Quad,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hookah Bar,Hostel,Hotel,Hotel Bar,Housing Development,IT Services,Ice Cream Shop,Indian Restaurant,Indoor Play Area,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Latin American Restaurant,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Market,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Movie Theater,Museum,Music Venue,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Organic Grocery,Other Great Outdoors,Park,Performing Arts Venue,Peruvian Restaurant,Pet Store,Pharmacy,Pier,Pilates Studio,Pizza Place,Playground,Plaza,Poke Place,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Recreation Center,Restaurant,Rock Climbing Spot,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Smoke Shop,Snack Place,Soccer Field,Soup Place,Southern / Soul Food Restaurant,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Steakhouse,Strip Club,Supermarket,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Tram Station,Tunnel,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.03,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.03,0.0,0.0,0.0,0.01,0.03,0.01,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.04,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.016,0.0,0.016,0.033,0.0,0.0,0.016,0.033,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.033,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.049,0.082,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.0,0.016,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.016,0.016,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.016,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.0,0.033,0.0,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.016,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.029,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.044,0.0,0.029,0.0,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.015,0.088,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.015,0.029,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.029,0.015,0.0,0.0,0.029,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.015,0.015,0.0,0.0,0.0,0.059,0.0,0.0,0.0,0.0,0.0,0.029,0.0,0.015,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.015,0.0,0.0,0.0,0.0,0.0,0.015,0.029,0.0,0.0,0.015,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.02,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.03,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.04,0.0,0.0,0.03,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.028,0.0,0.042,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.014,0.0,0.014,0.0,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.028,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.014,0.014,0.0,0.0,0.0,0.0,0.0,0.028,0.014,0.07,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.014,0.014,0.0,0.014,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.028,0.0,0.014,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.028,0.014,0.0,0.014,0.028,0.0,0.0,0.0,0.0,0.0,0.028,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.014,0.028,0.0,0.0,0.014,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.014


In [37]:
# define a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [45]:
# creat a new DF for the top 10 venues in each Postcode
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhoods']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhoods'] = TNT_grouped['Neighbourhoods']

for ind in np.arange(TNT_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(TNT_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(5)

Unnamed: 0,Neighbourhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Hotel,Steakhouse,Restaurant,Breakfast Spot,American Restaurant,Gastropub,Bar,Asian Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Seafood Restaurant,Cheese Shop,Bakery,Farmers Market,Café,Lounge,Hotel
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Restaurant,Café,Furniture / Home Store,Bakery,Juice Bar,Bar,Hotel,Gym,Sandwich Place
3,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Bar,Hotel,Steakhouse,Seafood Restaurant,Sushi Restaurant,Gym,Café,Thai Restaurant,Pub
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Coffee Shop,Italian Restaurant,Bar,Café,Intersection,Sandwich Place,Gym / Fitness Center,Electronics Store,Park,Restaurant


_Now let's cluster the neighbourhoods using K-means clustering. We will use 4 clusters_

In [46]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 6

TNT_grouped_clustering = TNT_grouped.drop('Neighbourhoods', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(TNT_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

TNT_merged = centre

# merge the grouped with the original centre dataa to add latitude/longitude for each Postcode
TNT_merged = TNT_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhoods'), on='Neighbourhoods')

TNT_merged.head() 

Unnamed: 0,Postcode,Borough,Neighbourhoods,Lat,Long,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.682,-79.378,0,Playground,Grocery Store,Park,Candy Store,Dumpling Restaurant,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.668,-79.367,5,Coffee Shop,Restaurant,Pizza Place,Café,Bakery,Italian Restaurant,Taiwanese Restaurant,Breakfast Spot,Indian Restaurant,Pub
2,M4Y,Downtown Toronto,Church and Wellesley,43.667,-79.381,5,Coffee Shop,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Gastropub,Men's Store,Pub,Fast Food Restaurant,Café
3,M5A,Downtown Toronto,Harbourfront,43.65,-79.359,5,Coffee Shop,Bakery,Café,Theater,Boat or Ferry,French Restaurant,Dessert Shop,Hotel,Ice Cream Shop,Historic Site
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657,-79.378,5,Coffee Shop,Clothing Store,Japanese Restaurant,Cosmetics Shop,Café,Tea Room,Fast Food Restaurant,Hotel,Diner,Italian Restaurant


_Now, let's visualize the clusters!_

In [47]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(TNT_merged['Lat'], TNT_merged['Long'], TNT_merged['Neighbourhoods'], TNT_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

_We can see that the majority of our clusters are in Cluster 5. Let's take a look at the clusters to see why_

In [48]:
TNT_merged.loc[TNT_merged['Cluster Labels'] == 5, TNT_merged.columns[[1] + list(range(5, TNT_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,5,Coffee Shop,Restaurant,Pizza Place,Café,Bakery,Italian Restaurant,Taiwanese Restaurant,Breakfast Spot,Indian Restaurant,Pub
2,Downtown Toronto,5,Coffee Shop,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Gastropub,Men's Store,Pub,Fast Food Restaurant,Café
3,Downtown Toronto,5,Coffee Shop,Bakery,Café,Theater,Boat or Ferry,French Restaurant,Dessert Shop,Hotel,Ice Cream Shop,Historic Site
4,Downtown Toronto,5,Coffee Shop,Clothing Store,Japanese Restaurant,Cosmetics Shop,Café,Tea Room,Fast Food Restaurant,Hotel,Diner,Italian Restaurant
5,Downtown Toronto,5,Coffee Shop,Café,Restaurant,Cocktail Bar,Bakery,Clothing Store,Beer Bar,Seafood Restaurant,Breakfast Spot,Cosmetics Shop
6,Downtown Toronto,5,Coffee Shop,Cocktail Bar,Restaurant,Seafood Restaurant,Cheese Shop,Bakery,Farmers Market,Café,Lounge,Hotel
7,Downtown Toronto,5,Coffee Shop,Clothing Store,Ice Cream Shop,Bakery,Bubble Tea Shop,Hotel,Gym / Fitness Center,Spa,Bookstore,Sandwich Place
8,Downtown Toronto,5,Coffee Shop,Café,Hotel,Steakhouse,Restaurant,Breakfast Spot,American Restaurant,Gastropub,Bar,Asian Restaurant
10,Downtown Toronto,5,Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Gastropub,American Restaurant,Steakhouse,Seafood Restaurant,Deli / Bodega
11,Downtown Toronto,5,Coffee Shop,Hotel,Café,Restaurant,Japanese Restaurant,Gym,Gastropub,Deli / Bodega,Steakhouse,Italian Restaurant


_In all these neighbourhoods, coffee shops and cafes are very common. Let's see what the clusters look like if we exclude these categories from the search_

In [55]:
#Exclude coffee shops and cafes
TNT_nocoffee = Toronto_venues[Toronto_venues['Venue Category'] != 'Coffee Shop']
TNT_nocoffee = TNT_nocoffee[TNT_nocoffee['Venue Category'] != 'Café']
TNT_nocoffee.head(5)
# alt method: TNT_nocoffee.drop(TNT_nocoffee.loc[TNT_nocoffee['Venue Category']=='Café'].index, inplace=True) 

Unnamed: 0,Neighbourhoods,Neighbourhoods Latitude,Neighbourhoods Longitude,Venue,Venue Lat,Venue Long,Venue Category
0,Rosedale,43.682,-79.378,Summerhill Market,43.686,-79.375,Grocery Store
1,Rosedale,43.682,-79.378,Rosedale Park,43.682,-79.379,Playground
2,Rosedale,43.682,-79.378,Whitney Park,43.682,-79.374,Park
3,Rosedale,43.682,-79.378,Scoops Convenience Boutique,43.686,-79.376,Candy Store
4,"Cabbagetown, St. James Town",43.668,-79.367,F'Amelia,43.668,-79.369,Italian Restaurant


In [56]:
#now let's carry out the same analysis as before:
TNT_onehot = pd.get_dummies(TNT_nocoffee[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhoods column back to dataframe
TNT_onehot['Neighbourhoods'] = TNT_nocoffee['Neighbourhoods'] 

# move neighborhood column to the first column
fixed_columns = [TNT_onehot.columns[-1]] + list(TNT_onehot.columns[:-1])
TNT_onehot = TNT_onehot[fixed_columns]

# group rows by neighbourhood and take the mean of the frequency of occurrence of each category
TNT_grouped = TNT_onehot.groupby('Neighbourhoods').mean().reset_index()

# creat a new DF for the top 10 venues in each Postcode (but no coffee!)
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhoods']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhoods'] = TNT_grouped['Neighbourhoods']

for ind in np.arange(TNT_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(TNT_grouped.iloc[ind, :], num_top_venues)

kclusters = 6

TNT_grouped_clustering = TNT_grouped.drop('Neighbourhoods', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(TNT_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

TNT_merged = centre

# merge the grouped with the original centre dataa to add latitude/longitude for each Postcode
TNT_merged = TNT_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhoods'), on='Neighbourhoods')

TNT_merged.head() 

Unnamed: 0,Postcode,Borough,Neighbourhoods,Lat,Long,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.682,-79.378,5,Playground,Grocery Store,Park,Candy Store,Eastern European Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.668,-79.367,2,Restaurant,Bakery,Pizza Place,Italian Restaurant,Taiwanese Restaurant,Breakfast Spot,Indian Restaurant,Pub,Japanese Restaurant,Jewelry Store
2,M4Y,Downtown Toronto,Church and Wellesley,43.667,-79.381,2,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Fast Food Restaurant,Pub,Hotel,Dance Studio,Men's Store,Gastropub
3,M5A,Downtown Toronto,Harbourfront,43.65,-79.359,2,Bakery,Boat or Ferry,Theater,Dessert Shop,Breakfast Spot,Pub,Mexican Restaurant,Ice Cream Shop,Farmers Market,Shoe Store
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657,-79.378,2,Clothing Store,Cosmetics Shop,Japanese Restaurant,Furniture / Home Store,Theater,Fast Food Restaurant,Bookstore,Restaurant,Ramen Restaurant,Hotel


In [57]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(TNT_merged['Lat'], TNT_merged['Long'], TNT_merged['Neighbourhoods'], TNT_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

_Well, some minor changes, but we can see that central Toronto is fairly homogeneous in terms of the type of venue offered_