# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by Jason

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project, we will use machine learning tools to cluster Toronto and New York neighborhoods in order to recommend the neighborhoods which are the best choices for migrants based on surrounded essential facilities such as school, hospital, and stores etc.

We will be using two datasets for this project: The first dataset toront.csv consists of Toronto’s boroughs, Neighbourhoods and their respective postcodes. The second dataset NewYork.csv consists of NewYork’s city name, districts and subdistrict. Both datasets were scraped from Wikipedia page.

Foursquare API provides access to massive datasets of location data and venues information including address, images, tips, ratings and comments. In this project, we will use Foursquare API and Geopy data to locate nearby venues within 500 meters of each neighbourhood in Toronto and New York.

## Data <a name="data"></a>

### Data Collection
Toronto ad New York neighborhoods' data will be scraped from Wikipedia page and converted into pandas dataframe. 

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import json

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

! pip install folium==0.5.0
import folium

print("Libraries imported")

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Libraries imported


In [2]:
# Toronto 
!wget -q -O 'toronto_data.csv' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
print('Data downloaded!')

df_toronto = pd.read_csv('toronto_data.csv')
df_toronto.head()

# Part 1 Data
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postal Code'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df_part1=pd.DataFrame(table_contents)
df_part1['Borough']=df_part1['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

# Merge
Toronto_neighborhoods_df = pd.merge(df_part1,
                 df_toronto[['Postal Code','Latitude', 'Longitude']],
                 on='Postal Code')
Toronto_neighborhoods_df.head()

Data downloaded!


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [3]:
# Same for New York
# Download data from external source
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
newyork_neighborhood = newyork_data['features']
# Transform data into a DataFrame
column_names = ['Postal Code','Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
newyork_df = pd.DataFrame(columns=column_names)
for data in newyork_neighborhood:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    newyork_df = newyork_df.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
newyork_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,,Bronx,Wakefield,40.894705,-73.847201
1,,Bronx,Co-op City,40.874294,-73.829939
2,,Bronx,Eastchester,40.887556,-73.827806
3,,Bronx,Fieldston,40.895437,-73.905643
4,,Bronx,Riverdale,40.890834,-73.912585


## Methodology

We now have the borough, neighborhood, latitude and longitude data files ready for Toronto and New York

Let's then merge the two city's data into one data frame

In [4]:
df_toronto_newyork =  pd.concat([Toronto_neighborhoods_df, newyork_df])
print('There are total {} neighbourhoods in Toronto and New York.'.format(df_toronto_newyork.shape[0]))
df_toronto_newyork.head()

There are total 409 neighbourhoods in Toronto and New York.


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


Data Visuliztion

In [15]:
# Define FourSquare User Credentials
CLIENT_ID = 'IYFBVOOC4FUPEP1IKI4ZIKTDVQHM10BBOUHPHD0KFSW3WJBQ' # your Foursquare ID
CLIENT_SECRET = 'HABZNP1ZU5SF0NLFQX15GHIP2XL1PGPK3P2HPWCSNATU1YE5' # your Foursquare Secret
ACCESS_TOKEN = 'Y5ZE2W2GA43XRCWLIZER0EOEEZGSHMGK3OE2IMPGCHUJC5AT' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: IYFBVOOC4FUPEP1IKI4ZIKTDVQHM10BBOUHPHD0KFSW3WJBQ
CLIENT_SECRET:HABZNP1ZU5SF0NLFQX15GHIP2XL1PGPK3P2HPWCSNATU1YE5


In [16]:
# Define a funciton that creates a map of different neighbourhoods in the city.
def getMap (cityname,countryname,dataframe):
    address = cityname + ',' + countryname

    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of ' + cityname + ' are {}, {}.'.format(latitude, longitude))
    
    # create map using latitude and longitude values.
    city_map = folium.Map(location=[latitude, longitude], zoom_start=10)

    # set color scheme for the Borough
    borough_name = dataframe['Borough'].unique().tolist()
    colnum = dataframe['Borough'].unique().size
    x = np.arange(colnum)
    ys = [i+x+(i*x)**2 for i in range(colnum)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    for lat, lon, Neighborhood, borough in zip(dataframe['Latitude'], dataframe['Longitude'], dataframe['Neighborhood'], dataframe['Borough']):
        cluster = borough_name.index(borough)
        label = '{}, {}'.format(Neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster],
            fill=True,
            fill_color=rainbow[cluster],
            fill_opacity=0.7).add_to(city_map)  
    return city_map

In [17]:
# Get Toronto's Map
getMap('Toronto','Canada',Toronto_neighborhoods_df)

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [18]:
# Get New York map
getMap('New York', 'United States', newyork_df)

The geograpical coordinate of New York are 40.7127281, -74.0060152.


Use Foursqaure API to get Venues

In [19]:
def getNearbyVenues(cities, boroughs, neighbourhoods, latitudes, longitudes, radius=500):
    
    # Define limit of venues to get
    LIMIT = 100
    
    venues_list=[]
    for city, borough, neighbourhood, lat, lng in zip(cities, boroughs, neighbourhoods, latitudes, longitudes):
        print(neighbourhood)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(city, 
            borough, 
            neighborhood, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                  'Borough',          
                  'Neighborhood', 
                  'Latitude', 
                  'Longitude', 
                  'Venue',  
                  'Venue Category']
    
    return(nearby_venues)

Call the getNearbyVenues function on each Neighborhood and create a new dataframe called toronto_venues

In [20]:
toronto_venues = getNearbyVenues(Toronto_neighborhoods_df['Postal Code'], Toronto_neighborhoods_df['Borough'], 
                                 Toronto_neighborhoods_df['Neighborhood'], Toronto_neighborhoods_df['Latitude'],
                                 Toronto_neighborhoods_df['Longitude'])

Parkwoods


KeyError: 'groups'

Check the size of DataFrame

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

Check how many venues were returned for each Neiborhood.

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues.groupby('Neighborhood').count()

Let's do the same for New York city

In [None]:
newyork_venues = getNearbyVenues(newyork_df['Postal Code'], newyork_df['Borough'], 
                                 newyork_df['Neighborhood'], newyork_df['Latitude'],
                                 newyork_df['Longitude']
                                )

In [None]:
print(newyork_venues.shape)
newyork_venues.head()

Check how many venues are returned in each neighborhood

In [None]:
print('There are {} uniques categories.'.format(len(newyork_venues['Venue Category'].unique())))
newyork_venues.groupby('Neighborhood').count()

For the combined DataFrame

In [None]:
toronto_newyork_venues =  pd.concat([toronto_venues,newyork_venues])
print(toronto_newyork_venues.shape)

toronto_newyork_venues.head()

In [None]:
# Make all restaurants category into one
toronto_newyork_venues.loc[toronto_newyork_venues['Venue Category'].str.contains('Restaurant', case=False), 'Venue Category'] = 'Restaurant'
print(toronto_newyork_venues.shape)
toronto_newyork_venues.head(10)

Perform onehot encoding

In [None]:
toronto_newyork_onehot = pd.get_dummies(toronto_newyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add Neighbourhood column back to dataframe
toronto_newyork_onehot['Neighborhood'] = toronto_newyork_venues['Neighborhood'] 

# move Neighbourhood column to the first column
fixed_columns = [toronto_newyork_onehot.columns[-1]] + list(toronto_newyork_onehot.columns[:-1])
print(toronto_newyork_onehot.shape)
toronto_newyork_onehot.head()

Group categories

In [None]:
toronto_newyork_grouped = toronto_newyork_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_newyork_grouped.shape)
toronto_newyork_grouped.head()

Print each Neighbourhood along with the top 5 most common venues.

In [None]:
num_top_venues = 5

for hood in toronto_newyork_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_newyork_grouped[toronto_newyork_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each Neighbourhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_newyork_venues_sorted = pd.DataFrame(columns=columns)
toronto_newyork_venues_sorted['Neighborhood'] = toronto_newyork_grouped['Neighborhood']

for ind in np.arange(toronto_newyork_grouped.shape[0]):
    toronto_newyork_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_newyork_grouped.iloc[ind, :], num_top_venues)

print(toronto_newyork_venues_sorted.shape)    
toronto_newyork_venues_sorted.head(10)


### Use Machine Learning algorithms

We will apply K-Means model to segment and cluster all the neighborhoods in Toronto and Shanghai.
First, we use Elbow Method to determine the value of K.

In [None]:
import sklearn
import matplotlib.pyplot as plt

Sum_of_squared_distances = []
K = range(1, 15)
toronto_newyork_grouped_clustering = toronto_newyork_grouped.drop('Neighborhood', 1)

for kvalues in K:
    km = KMeans(n_clusters=kvalues, init='k-means++', n_init=10, max_iter=300, tol=0.0001,  random_state=10).fit(toronto_newyork_grouped_clustering)
    Sum_of_squared_distances.append(km.inertia_)
    
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum of Squared Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

The elbow point of the line chart is determined as the right K for clustering. Here K equals to 4.

In [None]:
# set number of clusters to 4
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_newyork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

Create a new dataframe that includes the cluster as well as the top 10 venues for each Neighbourhood in Toronto and New York.

In [None]:
# add clustering labels
toronto_newyork_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

toronto_newyork_merged = df_toronto_newyork

# merge sorted venues with df_toronto_sh to add latitude/longitude for each Neighbourhood
toronto_newyork_merged = toronto_newyork_merged.join(toronto_newyork_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#Drop the rows that have no data.
toronto_newyork_merged=toronto_newyork_merged.dropna()
toronto_newyork_merged.shape

print(toronto_newyork_merged.shape)
toronto_newyork_merged.head(10)

Define a function to visualize the resulting clusters.

In [None]:
def displayClusters(city, country, dataframe):

    address = city + ',' + country

    geolocator = Nominatim(user_agent="Foursquare_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    cluster_map = folium.Map(location=[latitude, longitude], zoom_start=10)

    # set color scheme for the Cluster_Labels
    x = np.arange(kclusters)
    ys = [i+x+(i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    for lat, lon, neighborhood, borough, cluster_labels in zip(dataframe['Latitude'], 
                                                                dataframe['Longitude'], 
                                                                dataframe['Neighborhood'], 
                                                                dataframe['Borough'],
                                                                dataframe['Cluster_Labels']):
        cluster = int(cluster_labels)
        label = '{}, {},Cluster {}'.format(neighborhood, borough, cluster)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster],
            fill=True,
            fill_color=rainbow[cluster],
            fill_opacity=0.7).add_to(cluster_map)
    return cluster_map


## Results

Let's check each neighborhood in each cluster

In [None]:
for cluster_num in range(4):
    num_of_nbh = toronto_newyork_merged[toronto_newyork_merged['Cluster_Labels'] == cluster_num].shape[0]
    print('The number of neighborhoods in cluster {} is {}'.format(cluster_num+1, num_of_nbh))

Mostly suburb areas where have access to hotels, airports and football stadium.

In [None]:
cluster1 = toronto_sh_merged.loc[toronto_sh_merged['Cluster_Labels'] == 0]
print('There are {} neighborhoods in cluster 1'.format(cluster1.shape[0]))
cluster1

In [21]:
print('Toronto:')
displayClusters('Toronto', 'Canada', cluster1[cluster1['City'] == 'Toronto'])

Toronto:


NameError: name 'displayClusters' is not defined

In [None]:
print('New York:')
displayClusters('New York', 'United States', cluster1[cluster1['City'] == 'New York'])

Basically are residential areas with parks, grocery stores, pharmacy and restaurants.

In [None]:
cluster2 = toronto_newyork_merged.loc[toronto_newyork_merged['Cluster_Labels'] == 1]
print('There are {} neighborhoods in cluster 2'.format(cluster2.shape[0]))
cluster2

In [None]:
print('Toronto:')
displayClusters('Toronto', 'Canada', cluster1[cluster1['City'] == 'Toronto'])

In [None]:
print('New York:')
displayClusters('New York', 'United States', cluster1[cluster1['City'] == 'New York'])

Including neighbourhoods with restaurants and distribution centers.

In [None]:
cluster3 = toronto_newyork_merged.loc[toronto_newyork_merged['Cluster_Labels'] == 2]
print('There are {} neighborhoods in cluster 3'.format(cluster3.shape[0]))
cluster3

In [None]:
print('Toronto:')
displayClusters('Toronto', 'Canada', cluster1[cluster1['City'] == 'Toronto'])

In [None]:
print('New York:')
displayClusters('New York', 'United States', cluster1[cluster1['City'] == 'New York'])

Mostly downtown areas where surrounded by lots of restaurants, cafeteria, bars, convenience stores and different kinds of shops.

In [None]:
cluster4 = toronto_newyork_merged.loc[toronto_newyork_merged['Cluster_Labels'] == 3]
print('There are {} neighborhoods in cluster 4'.format(cluster4.shape[0]))
cluster4

In [None]:
print('Toronto:')
displayClusters('Toronto', 'Canada', cluster1[cluster1['City'] == 'Toronto'])

In [None]:
print('New York:')
displayClusters('New York', 'United States', cluster1[cluster1['City'] == 'New York'])

## Discussion
From the results, we can conclude that for those who prefer to settle down in a residential area where surrounded by parks, grocery stores, pharmacy and restaurants, cluster 2 would be the best choice. While for those who prefer to live in a more crowded area where have access to a variety of venues, cluster 4 would be the best choice.
However, from the results, we notice that the majority of the neighbourhoods in Toronto lie into cluster 2 and 4. This is due to the limitations this research hold. To result in a better clustering, we will need further data such as more detailed venues information in New York. 