# Applied Data Science Capstone
## The Battle of the neighborhoods

### Scenario:
Say you live on the west side of the city of Toronto in Canada. You love your neighborhood, mainly because of all the great amenities and other types of venues that exist in the neighborhood, such as gourmet fast food joints, pharmacies, parks, grad schools and so on. Now, say you receive a job offer from a great company on the other side of the city with great career prospects. However, given the far distance from your current place you unfortunately must move if you decide to accept the offer. Wouldn't it be great if you're able to determine neighborhoods on the other side of the city that are exactly the same as your current neighborhood, and if not perhaps similar neighborhoods that are at least closer to your new job? 

### Objective
what you will learn to do is given a city like the City of Toronto, you will segment it into different neighborhoods using the geographical coordinates of the center of each neighborhood, and then using a combination of location data and machine learning, you will group the neighbourhoods into clusters.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Part 1: Scrapping Wikipedia  

In [3]:
import pandas as pd
import requests # library to handle requestsimport requests # library to handle requests
from bs4 import BeautifulSoup # library to scrape using css selectors
from geopy.geocoders import Nominatim

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_text = requests.get(url).text
wiki_soup = BeautifulSoup(markup=wiki_text, features='html.parser')

In [5]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
column_names = ['PostalCode', 'Borough', 'Neighborhood'] # define the dataframe columns
neighborhoods = pd.DataFrame(columns=column_names)# define the dataframe columns
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


In [6]:
table_rows = wiki_soup.select('.wikitable > tbody > tr') # Select the rows of the table
for table_row in table_rows:
    row = table_row.select('td') # Select the elements of a row
    if len(row) == 0: # The header has no 'td' elements
        continue    # Move to the next row
    if row[1].text != 'Not assigned\n': # Ignore cells with a borough that is Not assigned
        postal_code = row[0].string.strip() # Extract the postal code
        borough = row[1].text.strip()       # Extract the borough name
        # If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.     
        neighborhood = (row[2].text if row[1].text != '\n' else borough).strip() 
        neighborhoods = neighborhoods.append({'PostalCode': postal_code,
                                                'Borough': borough,
                                                'Neighborhood': neighborhood},
                                                ignore_index=True)
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
neighborhoods.shape

(103, 3)

## Part 2: Getting lat-long values for each neighborhood

In [8]:
from config import google_key
from geopy.geocoders import GoogleV3
geolocator = GoogleV3(api_key=google_key,user_agent="toronto_explorer")

In [9]:
postal_code_list = neighborhoods['PostalCode'].to_list()

In [10]:
latlong_hood = pd.DataFrame(columns=['PostalCode', 'Latitude', 'Longitude'])
latlong_hood

Unnamed: 0,PostalCode,Latitude,Longitude


In [11]:
for postal_code in postal_code_list:
    address = f'{postal_code}, Toronto, Ontario'
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    latlong_hood = latlong_hood.append({'PostalCode': postal_code,
                                        'Latitude': latitude, 
                                        'Longitude': longitude},
                                        ignore_index=True)

In [12]:
neighborhoods = neighborhoods.join(latlong_hood.set_index('PostalCode'), on='PostalCode')

In [13]:
neighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Part 3: Clustering the neighborhoods

In [14]:
from config import foursquare_clientid, foursquare_clientsecret, foursquare_version


In [28]:
LIMIT=100
def getNearbyVenues(postalcodes, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for postalcode, lat, lng in zip(postalcodes, latitudes, longitudes):
        print(postalcode)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            foursquare_clientid, 
            foursquare_clientsecret, 
            foursquare_version, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postalcode, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [29]:
toronto_venues = getNearbyVenues(postalcodes=neighborhoods['PostalCode'],
                                latitudes=neighborhoods['Latitude'], 
                                longitudes=neighborhoods['Longitude'])

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


### There are neighborhoods without venues

In [64]:
neighborhoods_w_venues = len(toronto_venues['Neighborhood Latitude'].unique())
total_neighborhoods = len(neighborhoods)
print(f'The total number of Toronto`s neighborhoods is {total_neighborhoods} but only {neighborhoods_w_venues} have venues')
print('This are the neighborhoods without venues')
hood_no_venue = neighborhoods[~neighborhoods['PostalCode'].isin(toronto_venues['PostalCode'])]
hood_with_venues = neighborhoods[neighborhoods['PostalCode'].isin(toronto_venues['PostalCode'])]
hood_no_venue

The total number of Toronto`s neighborhoods is 103 but only 98 have venues
This are the neighborhoods without venues


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
11,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
45,M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714
52,M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493
95,M1X,Scarborough,Upper Rouge,43.836125,-79.205636


In [32]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot.insert(loc=0, column='PostalCode', value=toronto_venues['PostalCode'])
toronto_onehot.shape

(2128, 273)

In [33]:
neighborhoods_venues = toronto_onehot.groupby('PostalCode').mean().reset_index()
neighborhoods_venues.shape

(98, 273)

In [34]:
from sklearn.cluster import KMeans

In [35]:
# set number of clusters
kclusters = 4

neighborhoods_venues_freq = neighborhoods_venues.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neighborhoods_venues_freq)

# check that we have the same amount of labels as neighborhoods with venues
len(kmeans.labels_)

98

In [36]:
neighborhood_cluster = pd.DataFrame(data=kmeans.labels_, index=neighborhoods_venues['PostalCode'], columns=['Cluster'])

In [37]:
cluster_centroids = neighborhoods_venues.join(neighborhood_cluster, on='PostalCode')
cluster_centroids = cluster_centroids.drop('PostalCode', 1)
cluster_centroids = cluster_centroids.groupby('Cluster').mean()

In [41]:
compare_top = 5
compare = pd.DataFrame(index=range(compare_top))
for index in cluster_centroids.index:
    c = cluster_centroids.loc[index]
    c = c.reset_index()
    c.columns = [f'venue{index}', f'freq{index}']
    c = c.sort_values(f'freq{index}', ascending=False).head(compare_top).reset_index(drop=True)
    compare = compare.join(c)
compare

Unnamed: 0,venue0,freq0,venue1,freq1,venue2,freq2,venue3,freq3
0,Park,0.052957,Baseball Field,0.75,Park,0.484848,Coffee Shop,0.10194
1,Grocery Store,0.051902,Construction & Landscaping,0.25,Playground,0.068182,Pizza Place,0.053422
2,Bakery,0.031992,Accessories Store,0.0,Pool,0.05303,Café,0.042835
3,Bar,0.028298,Middle Eastern Restaurant,0.0,Trail,0.05303,Restaurant,0.031732
4,Skating Rink,0.027104,Monument / Landmark,0.0,Convenience Store,0.045455,Sandwich Place,0.029916


### Clusters
We will name this 4 cluster in the following way:
* Cluster 0: suburbs: parks, groceries, etc
* Cluster 1: big open fields
* Cluster 2: parks and recreation
* cluster 3: coffee shops and restaurants

In [50]:
import folium

In [58]:
# we are going to make a list for the colors and icons in font awesome
cluster_icon = [('home', 'blue'), ('wrench', 'black'), ('tree', 'green'), ('shopping-bag', 'red')]

In [66]:
neighborhoods_clustered = hood_with_venues.join(neighborhood_cluster, on='PostalCode')

In [67]:
# Create the map
toronto = geolocator.geocode('Toronto, Ontario')
toronto_clusters = geolocator.geocode('Toronto, Ontario')
map_toronto_clusters = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=11, tiles='Stamen Toner')

# marker_cluster = [plugins.MarkerCluster().add_to(map_toronto_clusters) for _ in range(kclusters)]
# add markers to the map
for lat, lon, poi, cluster in zip(neighborhoods_clustered['Latitude'], neighborhoods_clustered['Longitude'], neighborhoods_clustered['Neighborhood'], neighborhoods_clustered['Cluster']):
    label_str = f'{poi}\nCluster:{cluster}'
    label = folium.Popup(label_str, parse_html=True)
    icon = folium.Icon(color=cluster_icon[cluster][1],icon=cluster_icon[cluster][0], prefix='fa')
    folium.Marker(
        location=[lat, lon],
        popup=label,
        tooltip=f'{poi.split(",")[0]} Cluster {cluster}',
        icon=icon).add_to(map_toronto_clusters)

map_toronto_clusters

### Result
The clustering results in a map like this:
![](toronto_clustered.jpg)