# IBM Applied Data Science Capstone Project. 
# Which Toronto neighborhood is best for the client to buy a home in?

This notebook is my Capstone Project for the IBM Applied Data Science Data Science Professional Certificate.

#### The hypothetical situation is:

You are a real estate agent in Toronto and have a client who accepted an awesome new data science position. You have shown the client 6 houses and they have narrowed their choices down to 3 wonderful homes in 3 different neighborhoods in Toronto. 

The clients are not concerned with commute time. However, the 3 most important factors to them are that there is a 1) park 2) coffee shop, and 3) gym close by in the neighborhood.

#### The 3 houses are in the neighborhoods of 1) Berczy Park, 2) Queen's Park, and 3) Rosedale.

The goal of this notebook and capstone project is to determine which neighborhood would best fit the client's criteria and thus, which house they should by.

In [1]:
#import the dependencies that I'll need
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#for data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# map rendering library
import folium 

# for kmeans
from sklearn.cluster import KMeans 

I'll first read in the postal codes, borough, and neighborhood data

In [2]:
#Read in the data and check how many rows and columns
wiki = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)[0]
print ('The wiki dataframe has', wiki.shape[0], 'rows and', wiki.shape[1], 'columns')

The wiki dataframe has 288 rows and 3 columns


In [3]:
#See the first few rows of the wiki dataframe
wiki.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


I'll now drop (ignore) boroughs that are 'Not assigned'. I also correct the spelling of Neighborhood in the dataframe. I then check the shape again.

In [4]:
# 3b) drop (ignore) boroughs that are 'Not assigned'. I also correct the spelling of Neighborhood in the dataframe. I then check the shape again.
wiki.rename(columns={'Neighbourhood':'Neighborhood'}, inplace=True)
wiki = wiki[wiki.Borough != 'Not assigned']
print ('There are now', wiki.shape[0], 'rows and', wiki.shape[1], 'columns in the wiki dataframe.')

There are now 211 rows and 3 columns in the wiki dataframe.


In [5]:
#Check the first few rows again...
wiki.head()

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### I now combine rows with multiple neighborhoods for a single postalcode
#### Note: I decided to make a seperate dataframe for 'borough'. This is so I don't get duplicate values in the 'borough' feature when I combine multiple neighborhoods in the 'neighborhood' feature. I then put them back together again into the same dataframe...

In [6]:
# First, I make a seperate 'borough' dataframe

borough = wiki[['Postcode', 'Borough']]
borough.head()

Unnamed: 0,Postcode,Borough
2,M3A,North York
3,M4A,North York
4,M5A,Downtown Toronto
5,M5A,Downtown Toronto
6,M6A,North York


In [7]:
# I then look at the shape (noticing that there are more rows in it than the wiki dataframe due to duplicate postcode values)
borough.shape

(211, 2)

In [8]:
# I then remove duplicate values for the same postcode. I then look at the shape again.
borough = borough.drop_duplicates(['Postcode'])
borough.shape

(103, 2)

In [9]:
# I know make a 'neighborhood' dataframe containing only the postcode and neighborhood (no borough). I then check the 1st few rows of the df.
neighborhood = wiki[['Postcode', 'Neighborhood']]
neighborhood.head()

Unnamed: 0,Postcode,Neighborhood
2,M3A,Parkwoods
3,M4A,Victoria Village
4,M5A,Harbourfront
5,M5A,Regent Park
6,M6A,Lawrence Heights


In [10]:
# Finally, I combine the multiple neighborhoods with the single postcodes.

neighborhood = neighborhood.groupby(['Postcode'], sort = False).agg(lambda x : ','.join(x))

In [11]:
# Check the first few rows to confirm it worked
neighborhood.head()

Unnamed: 0_level_0,Neighborhood
Postcode,Unnamed: 1_level_1
M3A,Parkwoods
M4A,Victoria Village
M5A,"Harbourfront,Regent Park"
M6A,"Lawrence Heights,Lawrence Manor"
M7A,Not assigned


#### I now Fill any 'Not assigned' Neighborhoods with the name of the Borough...

In [12]:
# First, I check to see how many there are...

neighborhood[neighborhood.Neighborhood == 'Not assigned']

Unnamed: 0_level_0,Neighborhood
Postcode,Unnamed: 1_level_1
M7A,Not assigned


In [13]:
# Now I need to check my 'borough' dataframe for the borough name of postcode M7A...
borough[borough.Postcode == 'M7A']

Unnamed: 0,Postcode,Borough
8,M7A,Queen's Park


#### I see there is only 1 row with a "not assigned" neighborhood value. So I will just change it individually now.

In [14]:
neighborhood = neighborhood.replace({'Neighborhood': r'Not assigned'}, {'Neighborhood': "Queen's Park"}, regex=True)

#### ... and now I'll check that I've corrected it...

In [15]:
neighborhood.head()

Unnamed: 0_level_0,Neighborhood
Postcode,Unnamed: 1_level_1
M3A,Parkwoods
M4A,Victoria Village
M5A,"Harbourfront,Regent Park"
M6A,"Lawrence Heights,Lawrence Manor"
M7A,Queen's Park


In [16]:
neighborhood[neighborhood.Neighborhood == 'Not assigned']

Unnamed: 0_level_0,Neighborhood
Postcode,Unnamed: 1_level_1


In [17]:
print ('My dataframe has', neighborhood.shape[0], 'rows')

My dataframe has 103 rows


#### Next, I'm loading in the latitude and longitude data, looking at the 1st few rows of the dataset, and renaming the column "Postal Code" to "Postcode" to match the neighborhood and borough dataframes.

In [18]:
geo = pd.read_csv('http://cocl.us/Geospatial_data')

In [19]:
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [20]:
geo.rename(columns={'Postal Code':'Postcode'}, inplace=True)
geo.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### I'm now merging the neighborhood and geo dataframes to get the lat and lon into the same dataframe as the postcode and neighborhoods.

In [21]:
neighborhood = pd.merge(neighborhood,geo[['Postcode','Latitude', 'Longitude']],on='Postcode', how='left')
neighborhood.head()

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude
0,M3A,Parkwoods,43.753259,-79.329656
1,M4A,Victoria Village,43.725882,-79.315572
2,M5A,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,43.662301,-79.389494


In [22]:
# Checking the shape...
neighborhood.shape

(103, 4)

### I now need to add back in the borough dataframe as well, so the neighborhood dataframe will now have all the appropriate features included.

In [23]:
# merge neighborhood and borough df's, sort by Latitude, and look at dataframe. 
neighborhood = pd.merge(neighborhood,borough[['Postcode','Borough']],on='Postcode', how='left')
neighborhood = neighborhood.sort_values(['Latitude'])
neighborhood.head()

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude,Borough
93,M8W,"Alderwood,Long Branch",43.602414,-79.543484,Etobicoke
88,M8V,"Humber Bay Shores,Mimico South,New Toronto",43.605647,-79.501321,Etobicoke
102,M8Z,"Kingsway Park South West,Mimico NW,The Queensw...",43.628841,-79.520999,Etobicoke
87,M5V,"CN Tower,Bathurst Quay,Island airport,Harbourf...",43.628947,-79.39442,Downtown Toronto
101,M8Y,"Humber Bay,King's Mill Park,Kingsway Park Sout...",43.636258,-79.498509,Etobicoke


In [24]:
# And it's shape...
neighborhood.shape

(103, 5)

### I first need to work only with the boroughs containing one of the 3 houses we are interested in. These are in the boroughs of 1) Berczy Park, 2) Queen's Park, and 3) Rosedale. I then analyze for venues. 

In [25]:
#Pull out data for the 3 buroughs of interest (notice that this also gives me the latitudes and longitudes that I will need below when I make the Foursquare calls)
toronto_data = neighborhood[neighborhood['Neighborhood'].str.contains('Berczy Park|Queen\'s Park|Rosedale')]
toronto_data

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude,Borough
20,M5E,Berczy Park,43.644771,-79.373306,Downtown Toronto
4,M7A,Queen's Park,43.662301,-79.389494,Queen's Park
91,M4W,Rosedale,43.679563,-79.377529,Downtown Toronto


#### Next, I use geopy library to get the latitude and longitude values of Toronto.

In [26]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### I know create a map of Toronto with the neighborhoods superimposed on top.

In [27]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=13,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [28]:
#### I first define my foursquare credentials and version (which I've hidden/deleted)

#### I then look at the different neighborhoods containing the word 'Toronto'

In [29]:
CLIENT_ID = 'MCAKKWNSRI12WVUBXPGTKULQA4OKRRFMAXRXMV5KFHV0ILRT' # your Foursquare ID
CLIENT_SECRET = 'DFULHUEWZS5Z1135L3W21L242NEFG01YZ501XY0QIE52CAZN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MCAKKWNSRI12WVUBXPGTKULQA4OKRRFMAXRXMV5KFHV0ILRT
CLIENT_SECRET:DFULHUEWZS5Z1135L3W21L242NEFG01YZ501XY0QIE52CAZN


In [30]:
toronto_data.loc[:, 'Neighborhood']

20     Berczy Park
4     Queen's Park
91        Rosedale
Name: Neighborhood, dtype: object

## 1. First, I'll analyze Berczy Park for parks, coffee shops, and gyms

In [31]:
# First, Berczy Park
neighborhood_latitude = toronto_data.loc[20, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[20, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[20, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Berczy Park are 43.644770799999996, -79.3733064.


#### Now, I'll get the top 100 venues that are in Marble Hill within a radius of 500 meters.

#### First, I create the GET request URL. Name your URL **url**.

In [32]:
# type your answer here
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius,
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=MCAKKWNSRI12WVUBXPGTKULQA4OKRRFMAXRXMV5KFHV0ILRT&client_secret=DFULHUEWZS5Z1135L3W21L242NEFG01YZ501XY0QIE52CAZN&v=20180605&ll=43.644770799999996,-79.3733064&radius=500&limit=100'

#### I now send the GET request 

In [33]:
results = requests.get(url).json()

#### All the information I need is in the *items* key. Before I proceed, I'll borrow the **get_category_type** function from the Foursquare tutorial.

In [34]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Now I'll clean the json and structure it into a *pandas* dataframe. I'll then check home many venues were returned by Foursquare.

In [35]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,LCBO,Liquor Store,43.642944,-79.37244
1,The Keg Steakhouse + Bar,Steakhouse,43.646676,-79.374822
2,Sony Centre for the Performing Arts,Concert Hall,43.646292,-79.376022
3,Hockey Hall Of Fame (Hockey Hall of Fame),Museum,43.646974,-79.377323
4,Biff's Bistro,French Restaurant,43.647085,-79.376342


In [36]:
berczy = nearby_venues[nearby_venues['categories'].str.contains('Park|Coffee Shop|Gym')]
berczy

Unnamed: 0,name,categories,lat,lng
14,Berczy Park,Park,43.648048,-79.375172
27,Starbucks,Coffee Shop,43.644424,-79.369294
37,Starbucks,Coffee Shop,43.646957,-79.378265
39,Everyday Gourmet (Teas & Coffees),Coffee Shop,43.648757,-79.371645
44,Starbucks,Coffee Shop,43.648738,-79.372519
55,Starbucks,Coffee Shop,43.644525,-79.36856


## 2. Second, I'll analyze Queen's Park for parks, coffee shops, and gyms

In [37]:
# First, Queen's Park
neighborhood_latitude = toronto_data.loc[4, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[4, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[4, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Queen's Park are 43.6623015, -79.3894938.


#### First, I create the GET request URL. Name your URL **url**.

In [38]:
# type your answer here
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius,
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=MCAKKWNSRI12WVUBXPGTKULQA4OKRRFMAXRXMV5KFHV0ILRT&client_secret=DFULHUEWZS5Z1135L3W21L242NEFG01YZ501XY0QIE52CAZN&v=20180605&ll=43.6623015,-79.3894938&radius=500&limit=100'

#### I now send the GET request 

In [39]:
results = requests.get(url).json()

#### All the information I need is in the *items* key. Before I proceed, I'll borrow the **get_category_type** function from the Foursquare tutorial.

In [40]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Now I'll clean the json and structure it into a *pandas* dataframe. I'll then check home many venues were returned by Foursquare.

In [41]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Queen's Park,Park,43.663946,-79.39218
1,Mercatto,Italian Restaurant,43.660391,-79.387664
2,Nando's Flame-Grilled Chicken,Portuguese Restaurant,43.661617,-79.386095
3,Coffee Island,Coffee Shop,43.664271,-79.386972
4,YMCA,Gym,43.662753,-79.384849


In [42]:
queens = nearby_venues[nearby_venues['categories'].str.contains('Park|Coffee Shop|Gym')]
queens

Unnamed: 0,name,categories,lat,lng
0,Queen's Park,Park,43.663946,-79.39218
3,Coffee Island,Coffee Shop,43.664271,-79.386972
4,YMCA,Gym,43.662753,-79.384849
5,Coffee Public,Coffee Shop,43.660763,-79.386184
6,Starbucks (MaRS),Coffee Shop,43.659456,-79.390411
18,Starbucks,Coffee Shop,43.658557,-79.390196
28,Hart House Gym,Gym,43.664172,-79.394888
30,Starbucks,Coffee Shop,43.660412,-79.392692
31,Tim Hortons,Coffee Shop,43.661038,-79.393797
33,Tim Hortons,Coffee Shop,43.658599,-79.388498


## 3. Third, I'll analyze Rosedale for parks, coffee shops, and gyms

In [43]:
# First, Queen's Park
neighborhood_latitude = toronto_data.loc[91, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[91, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[91, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rosedale are 43.6795626, -79.37752940000001.


#### First, I create the GET request URL. Name your URL **url**.

In [44]:
# type your answer here
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius,
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=MCAKKWNSRI12WVUBXPGTKULQA4OKRRFMAXRXMV5KFHV0ILRT&client_secret=DFULHUEWZS5Z1135L3W21L242NEFG01YZ501XY0QIE52CAZN&v=20180605&ll=43.6795626,-79.37752940000001&radius=500&limit=100'

#### I now send the GET request 

In [45]:
results = requests.get(url).json()

#### All the information I need is in the *items* key. Before I proceed, I'll borrow the **get_category_type** function from the Foursquare tutorial.

In [46]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Now I'll clean the json and structure it into a *pandas* dataframe. I'll then check home many venues were returned by Foursquare.

In [47]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Rosedale Park,Playground,43.682328,-79.378934
1,Whitney Park,Park,43.682036,-79.373788
2,Alex Murray Parkette,Park,43.6783,-79.382773
3,Milkman's Lane,Trail,43.676352,-79.373842


In [48]:
rosedale = nearby_venues[nearby_venues['categories'].str.contains('Park|Coffee Shop|Gym')]
rosedale

Unnamed: 0,name,categories,lat,lng
1,Whitney Park,Park,43.682036,-79.373788
2,Alex Murray Parkette,Park,43.6783,-79.382773


### Exploring Neighborhoods in Toronto

#### I'll first create a function to repeat the same process to all the neighborhoods in Toronto

In [49]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now I'll write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [50]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Berczy Park
Queen's Park
Rosedale


In [51]:
print(toronto_venues.shape)
toronto_venues.head()

(99, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Berczy Park,43.644771,-79.373306,LCBO,43.642944,-79.37244,Liquor Store
1,Berczy Park,43.644771,-79.373306,The Keg Steakhouse + Bar,43.646676,-79.374822,Steakhouse
2,Berczy Park,43.644771,-79.373306,Sony Centre for the Performing Arts,43.646292,-79.376022,Concert Hall
3,Berczy Park,43.644771,-79.373306,Hockey Hall Of Fame (Hockey Hall of Fame),43.646974,-79.377323,Museum
4,Berczy Park,43.644771,-79.373306,Biff's Bistro,43.647085,-79.376342,French Restaurant


In [52]:
WantList = toronto_venues[toronto_venues['Venue Category'].str.contains('Park|Coffee Shop|Gym')]
WantList

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
14,Berczy Park,43.644771,-79.373306,Berczy Park,43.648048,-79.375172,Park
27,Berczy Park,43.644771,-79.373306,Starbucks,43.644424,-79.369294,Coffee Shop
37,Berczy Park,43.644771,-79.373306,Starbucks,43.646957,-79.378265,Coffee Shop
39,Berczy Park,43.644771,-79.373306,Everyday Gourmet (Teas & Coffees),43.648757,-79.371645,Coffee Shop
44,Berczy Park,43.644771,-79.373306,Starbucks,43.648738,-79.372519,Coffee Shop
55,Berczy Park,43.644771,-79.373306,Starbucks,43.644525,-79.36856,Coffee Shop
56,Queen's Park,43.662301,-79.389494,Queen's Park,43.663946,-79.39218,Park
59,Queen's Park,43.662301,-79.389494,Coffee Island,43.664271,-79.386972,Coffee Shop
60,Queen's Park,43.662301,-79.389494,YMCA,43.662753,-79.384849,Gym
61,Queen's Park,43.662301,-79.389494,Coffee Public,43.660763,-79.386184,Coffee Shop


In [53]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, neighborhood, venue in zip(WantList['Venue Latitude'], WantList['Venue Longitude'], WantList['Neighborhood'], WantList['Venue Category']):
    label = '{}, {}'.format(neighborhood, venue)
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color= 'red',
        fill=True,
        fill_color='#3187cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Analyzing each neighborhood...

In [54]:
# one hot encoding
wantlist_onehot = pd.get_dummies(WantList['Venue Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
wantlist_onehot['Neighborhood'] = WantList['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [wantlist_onehot.columns[-1]] + list(wantlist_onehot.columns[:-1])
wantlist_onehot = wantlist_onehot[fixed_columns]

wantlist_onehot


Unnamed: 0,Neighborhood,Coffee Shop,Gym,Park
14,Berczy Park,0,0,1
27,Berczy Park,1,0,0
37,Berczy Park,1,0,0
39,Berczy Park,1,0,0
44,Berczy Park,1,0,0
55,Berczy Park,1,0,0
56,Queen's Park,0,0,1
59,Queen's Park,1,0,0
60,Queen's Park,0,1,0
61,Queen's Park,1,0,0


In [55]:
wantlist_onehot.shape

(20, 4)

#### Now, I'll group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [56]:
wantlist_grouped = wantlist_onehot.groupby('Neighborhood').mean().reset_index()
wantlist_grouped

Unnamed: 0,Neighborhood,Coffee Shop,Gym,Park
0,Berczy Park,0.833333,0.0,0.166667
1,Queen's Park,0.75,0.166667,0.083333
2,Rosedale,0.0,0.0,1.0


#### I can also print each neighborhood along with the top 3 most common venues for each neighborhood like this...

In [57]:
num_top_venues = 3

for hood in wantlist_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = wantlist_grouped[wantlist_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
         venue  freq
0  Coffee Shop  0.83
1         Park  0.17
2          Gym  0.00


----Queen's Park----
         venue  freq
0  Coffee Shop  0.75
1          Gym  0.17
2         Park  0.08


----Rosedale----
         venue  freq
0         Park   1.0
1  Coffee Shop   0.0
2          Gym   0.0




#### Now, I'll put that into a *pandas* dataframe

#### First, I'll write a function to sort the venues in descending order.

In [58]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Now I'll create the new dataframe and display the top 3 venues for each neighborhood.

In [59]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = wantlist_grouped['Neighborhood']

for ind in np.arange(wantlist_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(wantlist_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Berczy Park,Coffee Shop,Park,Gym
1,Queen's Park,Coffee Shop,Gym,Park
2,Rosedale,Park,Gym,Coffee Shop


## 4. Cluster Neighborhoods

#### I'll now Run *k*-means to cluster the neighborhood into 5 clusters.

In [60]:
# set number of clusters
kclusters = 3

wantlist_grouped_clustering = wantlist_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(wantlist_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 1], dtype=int32)

#### I'll now create a new dataframe that includes the cluster as well as the top 3 venues for each neighborhood.

In [61]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

wantlist_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
wantlist_merged = wantlist_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

wantlist_merged # check the last columns!

Unnamed: 0,Postcode,Neighborhood,Latitude,Longitude,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
20,M5E,Berczy Park,43.644771,-79.373306,Downtown Toronto,0,Coffee Shop,Park,Gym
4,M7A,Queen's Park,43.662301,-79.389494,Queen's Park,2,Coffee Shop,Gym,Park
91,M4W,Rosedale,43.679563,-79.377529,Downtown Toronto,1,Park,Gym,Coffee Shop


#### Finally, I'll visualize the resulting clusters

In [62]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(wantlist_merged['Latitude'], wantlist_merged['Longitude'], wantlist_merged['Neighborhood'], wantlist_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-2],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

#### Now, I can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, I can then assign a name to each cluster. 

#### Cluster 1 (Urban; Business)

In [63]:
wantlist_merged.loc[wantlist_merged['Cluster Labels'] == 0, wantlist_merged.columns[[1] + list(range(5, wantlist_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
20,Berczy Park,0,Coffee Shop,Park,Gym


#### Cluster 2 (Suburban; residential)

In [64]:
wantlist_merged.loc[wantlist_merged['Cluster Labels'] == 1, wantlist_merged.columns[[1] + list(range(5, wantlist_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
91,Rosedale,1,Park,Gym,Coffee Shop


#### Cluster 3 ('rural'; suburban; residential)

In [65]:
wantlist_merged.loc[wantlist_merged['Cluster Labels'] == 2, wantlist_merged.columns[[1] + list(range(5, wantlist_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
4,Queen's Park,2,Coffee Shop,Gym,Park


In [66]:
wantlist_grouped

Unnamed: 0,Neighborhood,Coffee Shop,Gym,Park
0,Berczy Park,0.833333,0.0,0.166667
1,Queen's Park,0.75,0.166667,0.083333
2,Rosedale,0.0,0.0,1.0


#Conclusions

#### 1. Rosedale only has parks. It does not have coffee shops or gyms. So we will eliminate Rosedale from our choices since it doesn't meet the client's requirements.

#### 2. Berczy Park has lots of coffee shops and 1 park, but does not have a gym. So we will also eliminate Berczy Park since it doesn't meet the client's requirements.

#### 3. Queen's Park, however, has lots of coffee shops, some gyms, and a park. It definitely meets the client's requirements!

#### 4. It turned out that the Kmeans clustering wasn't really needed to answer the question, but I showed it anyway. 

## ANSWER: We choose to suggest that the clients buy the house in Queen's Park since it has all the things they want in a neighborhood!! New data science job... here they come!!