<a id='TableOfContents'></a>
## Peer Graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
##### This is the notebook for the peer graded assignment in the 3rd module of the course: Applied Data Science Capstone
****
In this assignment we will be using several techniques, mainly focused on scraping and cleaning data from the web (Wiki), making API calls <i>(towards Foursquare)</i> to acquire location data and finally clustering as a means to better visualize our data.<br>
###### Enjoy!

### Table of Contents
1. [Part One - Scraping and Cleaning](#PartOne)
2. [Part Two - Getting Location Data](#PartTwo)
3. [Part Three - Cluster the Neighborhoods](#PartThree)

<i>Note: use the hyperlinks above to jump to the corresponding assignment's part</i>
***

<a id='zero'></a>
****
### Libraries
****

In [92]:
# Import libraries needed

#import pandas as pd
#import numpy as np
#from urllib.request import urlopen
#import requests
#from bs4 import BeautifulSoup
#import json
#from pandas.io.json import json_normalize
#import xml
#from sklearn.cluster import KMeans
#import matplotlib.cm as cm
import matplotlib.colors as colors
#%matplotlib inline
#import geopy
#from geopy.geocoders import Nominatim
#from collections import OrderedDict
#!conda install -c conda-forge --no-deps altair --yes
#!conda install -c conda-forge --no-deps vincent --yes
#!conda install -c conda-forge geopy=1.19.0 --yes
!conda install -c conda-forge --no-deps folium=0.10.0 --yes
#!pip install folium
import folium
print ('All good!')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.10.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.10.0              |             py_1          59 KB  conda-forge

The following packages will be UPDATED:

    folium: 0.5.0-py_0 conda-forge --> 0.10.0-py_1 conda-forge


Downloading and Extracting Packages
folium-0.10.0        | 59 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
All good!


In [27]:
# Wikipedia url for our Toronto neighborhoods data; an earlier version for the correct data structure
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=942851379'
# Beautifulsoup to download html data
req = requests.get(url)
soup = BeautifulSoup(req.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
neighborhood=pd.DataFrame(df[0])

<a id='PartOne'></a>
****
### Part One - Scraping and Cleaning
****

In [28]:
# Let's check our data structure
neighborhood.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [30]:
# Drop "Not Assigned" Neighborhoods
neighborhood['Neighbourhood'].replace('Not assigned', np.nan, inplace=True)
neighborhood.dropna(subset=['Neighbourhood'], inplace=True)
neighborhood.reset_index(drop=True, inplace=True)
# Check our current data
print('Our dataframe has {} rows in total and {} with value "Not Assigned" in the Neighbourhood column'.
      format(neighborhood.shape[0], len(neighborhood[neighborhood['Neighbourhood']=='Not assigned'])))
# Check if there are "Boroughs" with "NA" values, in order to replace them with their respective Neighborhood
print('The column "Borough" has {} rows with the value "Not Assigned"'.
     format(len(neighborhood[neighborhood['Borough']=='Not assigned'])))

Our dataframe has 210 rows in total and 0 with value "Not Assigned" in the Neighbourhood column
The column "Borough" has 0 rows with the value "Not Assigned"


In [34]:
# Group our data by Postcode and Borough
neighborhood = neighborhood.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
neighborhood.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


****
As the final point in the first part, let's see what's the shape of our dataframe now.
****

In [38]:
print("The dataframe's shape is {}".format(neighborhood.shape))

The dataframe's shape is (103, 3)


<a id='PartTwo'></a>
****
### Part Two - Getting Location Data
****

<i>Note: After a bit of fooling around with the Geocoder, it is indeed unreliable to use, so I will continue using the given .csv file

In [40]:
# Getting the .csv file from the url provided in the assignment
geo_coord_url = 'https://cocl.us/Geospatial_data'
geo_coord_data = pd.read_csv(geo_coord_url)
# Check ths file's structure
geo_coord_data.head(2)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497


In [44]:
# A bit of manipulation for easier data merge
geo_coord_data.columns = ['Postcode', 'Latitude', 'Longitude']
# Merge of the two tables into a new one and see our result
Toronto_geodata = pd.merge(neighborhood, geo_coord_data, how = 'left', on = 'Postcode')
Toronto_geodata.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [45]:
print("Our combined dataframe's shape is {}".format(Toronto_geodata.shape))

Our combined dataframe's shape is (103, 5)


<a id='PartThree'></a>
****
### Part Three - Cluster the Neigborhoods
****
<i>For this part I'll be using the guidelines provided in the course's lab about NY</i>

[Jump to the Final Map](#FinalMap)

In [55]:
# Toronto's coord's using geopy
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent = 'toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [56]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(Toronto_geodata['Latitude'], Toronto_geodata['Longitude'], Toronto_geodata['Borough'], Toronto_geodata['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [57]:
# @hidden_cell
CLIENT_ID = 'CPTE1KSLDYYKCH4OJIF1FCD1ACUZNQD03KSM2ZKMPNVDDX1V' # your Foursquare ID
CLIENT_SECRET = '2ZW4RJRKB01CGYAGU3UV1RAPMY4VWSKFX4PESY520GEXJWG1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [58]:
# getNearbyVenues function from the lab

LIMIT=100 # limit of number of venues returned by Foursquare API
radius=500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [61]:
# Venues for each Neighborhood
Toronto_venues = getNearbyVenues(names = Toronto_geodata['Neighbourhood'],
                                   latitudes = Toronto_geodata['Latitude'],
                                   longitudes = Toronto_geodata['Longitude'])

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [63]:
print("The shape of the dataframe is {}.".format(Toronto_venues.shape))
toronto_venues.head()

The shape of the dataframe is (2197, 7).


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa


In [73]:
# Data cleaning and grouping
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighbourhood']=Toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
toronto_onehot = Toronto_onehot[fixed_columns]

# Group our data
Toronto_grouped = Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
print("The shape of the dataframe is {}".format(Toronto_grouped.shape))

The shape of the dataframe is (98, 271)


In [75]:
# Let's try out the top 5 venues per Neighborhood!
num_top_venues = 5

for hood in Toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = Toronto_grouped[Toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
         venue  freq
0  Coffee Shop  0.07
1   Restaurant  0.06
2         Café  0.05
3          Bar  0.03
4       Bakery  0.03


----Agincourt----
                       venue  freq
0             Breakfast Spot   0.2
1                     Lounge   0.2
2             Clothing Store   0.2
3               Skating Rink   0.2
4  Latin American Restaurant   0.2


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                       venue  freq
0                 Playground   0.5
1                       Park   0.5
2          Accessories Store   0.0
3  Middle Eastern Restaurant   0.0
4                      Motel   0.0


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
                 venue  freq
0        Grocery Store  0.22
1          Pizza Place  0.11
2         Liquor Store  0.11
3       Sandwich Place  0.11
4  Fried Chicken Joint  0.11


----Alderwood, Long Branch----
       

In [76]:
# insert to dataframe
# define our function
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Create a new dataframe with the top 10 venues

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighbourhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Restaurant,Café,Bar,Bakery,Clothing Store,Office,Pizza Place,Lounge,Gym
1,Agincourt,Clothing Store,Lounge,Breakfast Spot,Skating Rink,Latin American Restaurant,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Yoga Studio,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pharmacy,Fried Chicken Joint,Pizza Place,Sandwich Place,Liquor Store,Beer Store,Fast Food Restaurant,German Restaurant,Department Store
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Gym,Skating Rink,Coffee Shop,Pub,Sandwich Place,Dog Run,Diner,Discount Store


### Time for Clustering!

In [78]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('Neighbourhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 2, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [98]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = Toronto_geodata

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

# Clear NaN values
Toronto_merged.dropna(subset=['Cluster Labels'], inplace=True)

Toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,Fast Food Restaurant,Diner,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,1.0,Bar,History Museum,Yoga Studio,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1.0,Rental Car Location,Mexican Restaurant,Breakfast Spot,Spa,Bank,Intersection,Electronics Store,Medical Center,Moving Target,Dog Run
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Korean Restaurant,Yoga Studio,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Caribbean Restaurant,Bakery,Thai Restaurant,Fried Chicken Joint,Athletics & Sports,Gas Station,Bank,Hakka Restaurant,Electronics Store,Eastern European Restaurant


<a id='FinalMap'></a>
## Final Map including clusters with Top10 Venues

In [99]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters