## Data Science Capstone Project
There are three chapters in this project:

<b> (1) Web Scraping the List of Postal Codes of Canada</b>

<b> (2) Adding Geolocation to Postcodes</b>

<b> (3) Explore the Neighborhoods in Toronto</b>


### (3) Explore the Neighborhoods in Toronto

First, read the output DataFrame postcode_geo_df from file.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import requests

postcode_geo_df = pd.read_csv('postcode_geo_df.csv')
postcode_geo_df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### a) Get the Foursquare data

Load libraries for k-means and mapping.

In [2]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Create a function to explore the neighborhoods in Toronto. Any credentials are defined in a hidden cell.

In [3]:
# @hidden_cell
CLIENT_ID = 'CUCRNJJVE4MO1K2AQNKL12PGWSI4EAQ3VC3K4CTJSJX510LO' # your Foursquare ID
CLIENT_SECRET = 'HJD1MR5RCJPQK2LFKVPXL4YF3LS1KLRTCEBEDY5MTLRLKSXZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Parameters for exploration

In [4]:
LIMIT = 100

Define function to loop over neighbourhoods.

In [5]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    print('')
    print('--- Start neighbourhood list. ---')
    print('')

    venues_list=[]
    num = 0
    for name, lat, lng in zip(names, latitudes, longitudes):
        num = num+1
        print(num,name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('')
    print('--- End neighbourhood list. ---')
    print('')
    return(nearby_venues)

Code to run the above function on each neighbourhood and create a new dataframe called *toronto_venues* or read data from file if the requests.get(url) command causes errors.

In [6]:
#toronto_venues = getNearbyVenues(names=postcode_geo_df['Neighbourhood'],
#                                    latitudes=postcode_geo_df['Latitude'],
#                                    longitudes=postcode_geo_df['Longitude']
#                                    )

toronto_venues = pd.read_csv('toronto_venues.csv')

Optional: Export of data to file and print out summary info about the venues. 

In [7]:
#export_csv = toronto_venues.to_csv (r'toronto_venues.csv', index = None, header=True)
print(toronto_venues.shape)

(2268, 7)


Restrict the list to restaurants only.

In [8]:
is_restaurant = toronto_venues['Venue Category'].str.contains('Restaurant')
toronto_venues = toronto_venues[is_restaurant]

Check how many venues were returned for each neighbourhood

In [9]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",27,27,27,27,27,27
Agincourt,1,1,1,1,1,1
"Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown",2,2,2,2,2,2
"Bathurst Manor,Downsview North,Wilson Heights",4,4,4,4,4,4
Bayview Village,2,2,2,2,2,2
"Bedford Park,Lawrence Manor East",10,10,10,10,10,10
Berczy Park,10,10,10,10,10,10
"Brockton,Exhibition Place,Parkdale Village",4,4,4,4,4,4
Business Reply Mail Processing Centre 969 Eastern,2,2,2,2,2,2
"Cabbagetown,St. James Town",11,11,11,11,11,11


Find out how many unique categories can be curated from all the returned venues

In [10]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 53 uniques categories.


### b) Further Analysis

In [11]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

Examine the new dataframe size.

In [12]:
toronto_onehot.shape

(535, 54)

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [13]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()

toronto_grouped.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,Cuban Restaurant,Dim Sum Restaurant,Doner Restaurant,Dumpling Restaurant,Eastern European Restaurant,Empanada Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,Filipino Restaurant,French Restaurant,German Restaurant,Gluten-free Restaurant,Greek Restaurant,Hakka Restaurant,Hotpot Restaurant,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Korean Restaurant,Latin American Restaurant,Malay Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Molecular Gastronomy Restaurant,New American Restaurant,Portuguese Restaurant,Ramen Restaurant,Restaurant,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,"Adelaide,King,Richmond",0.0,0.111111,0.111111,0.0,0.037037,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.037037,0.0,0.037037,0.037037,0.0,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.111111,0.037037,0.0,0.074074,0.0,0.0,0.148148,0.0,0.0,0.037037,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bathurst Manor,Downsview North,Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirm the new size

In [14]:
toronto_grouped.shape

(68, 54)

If interested, un-comment and print each neighborhood along with the top 5 most common venues

In [15]:
#num_top_venues = 10
#
#for hood in toronto_grouped['Neighbourhood']:
#    print("----"+hood+"----")
#    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
#    temp.columns = ['venue','freq']
#    temp = temp.iloc[1:]
#    temp['freq'] = temp['freq'].astype(float)
#    temp = temp.round({'freq': 2})
#    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
#    print('\n')

<b>Put results into a *pandas* dataframe</b>

First, write a function to sort the venues in descending order.

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Then create the new dataframe and display the top 10 venues for each neighborhood.

In [17]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighbourhoods_venues_sorted.shape)
neighbourhoods_venues_sorted.head()

(68, 6)


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide,King,Richmond",Thai Restaurant,Restaurant,American Restaurant,Asian Restaurant,Sushi Restaurant
1,Agincourt,Chinese Restaurant,Vietnamese Restaurant,Dumpling Restaurant,Greek Restaurant,Gluten-free Restaurant
2,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Japanese Restaurant,Fast Food Restaurant,Vietnamese Restaurant,Dumpling Restaurant,Greek Restaurant
3,"Bathurst Manor,Downsview North,Wilson Heights",Fast Food Restaurant,Sushi Restaurant,Restaurant,Middle Eastern Restaurant,Vietnamese Restaurant
4,Bayview Village,Japanese Restaurant,Chinese Restaurant,Vietnamese Restaurant,Dumpling Restaurant,Greek Restaurant


### c) Cluster Neighborhoods

Run k-means to cluster the neighbourhood into 6 clusters. This is the number of clusters this data divides into even when selecting a higher number, e.g. 10 clusters, as a starting value.

In [18]:
# set number of clusters
kclusters = 6

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_)

[4 2 3 4 2 4 4 4 3 4 3 4 4 4 4 4 4 3 4 4 4 4 4 5 4 1 2 3 4 4 4 0 0 0 4 4 4
 4 3 4 4 3 2 2 0 4 4 2 1 4 3 4 3 4 4 4 4 4 4 4 4 1 2 4 2 4 4 3]


Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood. Then, in this order, insert geo data, drop rows with NaN, and set cluster labels to type 'int'.

In [19]:
# add clustering labels
# use drop if rerun at any time without deleting ol output.
#neighbourhoods_venues_sorted.drop('Cluster Labels', axis=1, inplace=True )
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [33]:
toronto_merged = postcode_geo_df
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged.dropna(axis=0, inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype('int')

toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,3,Fast Food Restaurant,Vietnamese Restaurant,Hotpot Restaurant,Greek Restaurant,Gluten-free Restaurant
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,0,Mexican Restaurant,Vietnamese Restaurant,Dumpling Restaurant,Greek Restaurant,Gluten-free Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Korean Restaurant,Vietnamese Restaurant,Dumpling Restaurant,Greek Restaurant,Gluten-free Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,4,Hakka Restaurant,Thai Restaurant,Caribbean Restaurant,Dumpling Restaurant,Greek Restaurant
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577,3,Fast Food Restaurant,Vietnamese Restaurant,Hotpot Restaurant,Greek Restaurant,Gluten-free Restaurant


### d) Visualize the resulting clusters

Map it. Observation: Only members of the largest cluster crowd the downtown area and a diverse set of 5 clusters of much fewer members plus some members of the largest cluster occupy the neighborhoods in the priphery.

In [39]:
#Toronto geo coordinates
latitude = 43.715383
longitude = -79.405678

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=1).add_to(map_clusters)
       
map_clusters

In [None]:
# Fin.