# Capstone Final Project-Battle of the Neighborhoods-wk1

By Lian Wang

July 2020

## Table of contents

* [Introduction](#Introduction)
* [Data](#Data)
* [Methodology](#Methodology)
* [Results](#Results)
* [Discussions](#Discussions)
* [Conclusions](#Conclusions)



## Introduction

In this age of convenient and efficient tranportation means, travelling around the world becomes easier and more common. In fact, people relocate around the world more often too with the globalization trend. 

When people move from one city to another, most simply pick an area/neighborhood near work/school first, then move to a different area/neighbor if desired after settling down and getting to know the surrounding areas more. Moving, sometimes several times, isn't unusual in this scenario. It is obviously not an optimal process since we are limited by the scope of information we have access to, often via word of mouth or physically checking out a few nearby areas. Moving, also brings anxiety for the common fear of unknown. It would be helpful if we are able to compare the new city to the city we currently live in to identify areas that might be a good living location for us in the new city, and be better prepared by understanding the difference in advance. It would potentially minimize the need or frequency of relocations, which is a big hassle for those relocating with a family, and ease the mental burden of relocation.

However, efforts to research a new city often only offer the city-level information, for example, population, histories, economic condition, climates, etc. It gives the overall picture of the city, which is more suitable for tourists but not for selecting an area for living. For the latter purpose, it is the neighborhood-level details, such as what kind of shops, entertaining facilities, atheletic centers and schools nearby, that are the focus of considerations.

Fortunately, with the advancement in technology, there are many location data platforms like **Foursquare** that provide detailed information on all kinds of venues around any geographical locations of interest. In this project, we intend to marry the rich location data provided by **Foursquare** and the power of machine learning to undertake comparisons of neighborhoods in two (or more) cities to fill this void of comparative information at the neighborhood level. We hope to help making relocation an easier and better experience with this additional dimention of information (packaged in a tool, if turn into a future App). Neighborhood-level comparisons among cities could also be valuable for people exploring and searching for their next stop (city) in life. For this project, we will focus on comparing the neighborhoods in New York City and Toronto as an illustrative example. The objective will be to provide a summary of how different/similar the two cities are based on their neighborhoods, as well as to offer recommendations for relocating between the two cities. 



### Data

In order to use the **Foursquare** platform to gather neighborhood venue information, we need data that contains the neighborhoods exist in each city as well as the latitude and logitude coordinates of each neighborhood. 





We will first install packages and load the necessary libraries. The codes in the cell below was run twice. First time included the installation of packages geopy and folium, which took a long time and genearted a lot of distracting outputs. The second time were run with the two installation lines commented out to hide the outputs from installation.

In [63]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # comment out this line once installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
#from pandas import json_normalize # for a newer Python version?

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # comment out this line once installed
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### 1. Load and extract neighborhood data for New York City

For New York city, this data had been compiled and exists in one file at https://cocl.us/new_york_dataset for this IBM course. The original source of this data is from https://geo.nyu.edu/catalog/nyu_2451_34572.

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

with open('newyork_data.json') as json_data:
    ny_data = json.load(json_data)
    
nyNBHs_data = ny_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
nyNBHs = pd.DataFrame(columns=column_names)

for data in nyNBHs_data:
    borough = data['properties']['borough'] # 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    nyNBHs = nyNBHs.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
nyNBHs['City'] = 'New York City'

nyNBHs.head() # if from saved combined NBHs dataframe, could use the subset nyNBHs = NBHs[NBHs['City']=='New York City']

Data downloaded!


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,City
0,Bronx,Wakefield,40.894705,-73.847201,New York City
1,Bronx,Co-op City,40.874294,-73.829939,New York City
2,Bronx,Eastchester,40.887556,-73.827806,New York City
3,Bronx,Fieldston,40.895437,-73.905643,New York City
4,Bronx,Riverdale,40.890834,-73.912585,New York City


nyNBHs is the data frame containing the New York City neighborhood data needed, and the output above shows the first 5 rows of the data. We could see there are 306 neighborhoods in New York City.

In [8]:
print('There are {} neighborhoods in New York City'.format(nyNBHs.shape))

There are (306, 5) neighborhoods in New York City


We could use Nominatim function in geopy library to extract the latitude and longitude of New York City and visualize the city with its neighborhoods in a map.

In [3]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode(address)
latitudeNY = location.latitude
longitudeNY = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitudeNY, longitudeNY)) ## 40.7127281, -74.0060152, if to avoid installing geopy

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [5]:
# create map of New York using latitude and longitude values
map_NY= folium.Map(location=[latitudeNY, longitudeNY], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(nyNBHs['Latitude'], nyNBHs['Longitude'], nyNBHs['Borough'], nyNBHs['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc', 
        fill_opacity=0.7,
        parse_html=False).add_to(map_NY)  
    
map_NY

### NOTE, Github can't render folium map. Please go to https://nbviewer.jupyter.org/, and use the link provided by me to view the notebook with map rendering capacity.

### 2. Load and create neighborhood data for Toronto

For Toronto, the list of neighborhood and corresponding postal code will be scraped from this Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Even though we could use geocoder Python package to retrieve the geographical location data based on postal codes, this package is unreliable (could get stuck in the process for unreasonably long time if using a while loop to ensure getting a result for each postal code). So, we will use the csv file containing the geographical location data for each of the postal code in Toronto that is provided for this IBM course at http://cocl.us/Geospatial_data.

In [9]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') 
#dfs[0].head()  #dfs[0] is the table we need

# assign dfs[0] to a new dataframe
df_tor = dfs[0]

# drop rows with Borough=='Not assigned'
df_tor = df_tor[df_tor.Borough != 'Not assigned'].reset_index(drop=True)
print('The shape of the data',df_tor.shape)
df_tor.head()

#(df_tor.Neighborhood=='Not assigned').value_counts() # 103 False, no "Not assigned" Neighborhood after dropping "Not assigned" Boroughs

print('There are {} rows of data.'.format(df_tor.shape[0]))


The shape of the data (103, 3)
There are 103 rows of data.


In [14]:
torCoordFromFile = pd.read_csv('http://cocl.us/Geospatial_data') # postal code with corresponding geographical data
#torCoordFromFile.head()

torNBHs=pd.merge(df_tor,torCoordFromFile)
torNBHs['City'] = 'Toronto'
torNBHs.rename(columns={'Neighbourhood':"Neighborhood"},inplace=True)

torNBHs.head() 


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,City
0,M3A,North York,Parkwoods,43.753259,-79.329656,Toronto
1,M4A,North York,Victoria Village,43.725882,-79.315572,Toronto
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Toronto
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Toronto
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Toronto


torNBHs is the data frame containing the Toronto neighborhood data needed, and the output above shows the first 5 rows of the data. We could see there are 103 neighborhoods/postal codes in Toronto.

Visualizing Toronto with its neighborhoods in a map.

In [11]:
address = 'Toronto, Ontario'

location = geolocator.geocode(address)
latitudeTor = location.latitude
longitudeTor = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitudeTor, longitudeTor)) # 43.6534817, -79.3839347, if to avoid installing geopy

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [15]:
# create map of Toronto using latitude and longitude values
map_Tor= folium.Map(location=[latitudeTor, longitudeTor], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(torNBHs['Latitude'], torNBHs['Longitude'], torNBHs['Borough'], torNBHs['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Tor)  
    
map_Tor

### NOTE, Github can't render folium map. Please go to https://nbviewer.jupyter.org/, and use the link provided by me to view the notebook with map rendering capacity.

### 3. Using Foursquare API to retrieve the venues data for neighborhoods

Once we have the neighborhood data with the appropriate geographical data for both cities, we will then use the **Foursquare API** to retrieve the venues infromation within a certain range of the rarius (say, 500 or 1000 meters) for each neighborhood. Service and activitiy venues, nearby within a neighborhood, are characteristics of a neighborhood and reflect the convenience and life style of people living in the area. Hence, quantifying these venues into categories and the associagted venue counts are meaningful features to use for classifying neighborhoods into clusters/groups. Because our purpose is to compare the two cities, we will compile a combined data set for clustering analysis based on neighborhood venue features, and then examine the distribution of the clusters/groups between the two cities.


In [16]:
NBHs = pd.concat([nyNBHs, torNBHs], ignore_index=True, join='inner') #combine the neighborhood data from the two cities into one
NBHs['City'].value_counts()

New York City    306
Toronto          103
Name: City, dtype: int64

The combined data frame contains 306 and 103 neighborhoods for New York City and Tornoto, respectively, matching the counts in separate data sets.

The hidden cell below contains the credential for accessing **Foursquare API**.

In [17]:
# The code was removed by Watson Studio for sharing.

Next, we borrow the _getNearbyVenues_ function from the course lab to request data from the API and extract relevant info.

In [44]:
def getNearbyVenues(cities, names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for city, name, lat, lng in zip(cities, names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            city,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                  'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



Retrieve up to 100 venues within 500 meters of the geographical location that defines each neighborhood, for all the neighborhoods in the two cities. We can see there are 12,186 venues returned, from 456 unique categories.

In [45]:
LIMIT = 100 # return top 100 venues 
radius = 500 # wihtin 500 meter of a location

# get the venues data for all the neighborhoods in combined NBHs data
Venues = getNearbyVenues(cities=NBHs['City'],
                         names=NBHs['Neighborhood'],
                         latitudes=NBHs['Latitude'],
                         longitudes=NBHs['Longitude']
                                  )
# checking the cleaned up data extracted from Foursquare
print(Venues.shape)
Venues.head()


(12186, 8)


Unnamed: 0,City,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,New York City,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,New York City,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
2,New York City,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
3,New York City,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,New York City,Wakefield,40.894705,-73.847201,Shell,40.894187,-73.845862,Gas Station


In [46]:
# learn a bit more about the Venues data

print('There are {} unique categories.'.format(len(Venues1['Venue Category'].unique())))
#Venues.groupby('Neighborhood').count()


There are 456 unique categories.


In [53]:
Venues.groupby(['City','Neighborhood']).count().sort_values(by=['Venue Category']).tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Neighborhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
New York City,Brooklyn Heights,100,100,100,100,100,100
New York City,Civic Center,100,100,100,100,100,100
Toronto,"Commerce Court, Victoria Hotel",100,100,100,100,100,100
New York City,Chelsea,105,105,105,105,105,105
New York City,Murray Hill,147,147,147,147,147,147


In [64]:
Venues.groupby(['City','Neighborhood']).count().sort_values(by=['Venue Category']).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Neighborhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Toronto,"Humberlea, Emery",1,1,1,1,1,1
New York City,Howland Hook,1,1,1,1,1,1
Toronto,"Willowdale, Newtonbrook",1,1,1,1,1,1
New York City,Somerville,1,1,1,1,1,1
Toronto,Humber Summit,1,1,1,1,1,1


When checking the range of venue numbers returned for individual neighborhood, we see that some neighborhoods only have one venue within the radius of 500 meters, while some have the max number of possible return of 100. We should keep in mind that there are potentially more than 100 venues for those neighborhoods. However, we also notice a couple neighborhoods ("Chelsea" and "Murray Hill") have over 100 venues, indicating there might be more than one Neighborhood named "Murray Hill"/"Chelsea". So, we went back to check how many Neighborhoods presented more than once in the neighborhood data.

In [59]:
NBHs.groupby(['City','Neighborhood']).count().sort_values(by=['Borough']).tail(10) # this confirms what we observed that 6 Neighborhood nanms represented more 
                                                                                # than once in the NBHs data frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Borough,Latitude,Longitude
City,Neighborhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
New York City,Kew Gardens,1,1,1
New York City,High Bridge,1,1,1
New York City,Jamaica Estates,1,1,1
New York City,Jamaica Hills,1,1,1
New York City,Chelsea,2,2,2
Toronto,Don Mills,2,2,2
New York City,Sunnyside,2,2,2
New York City,Murray Hill,2,2,2
New York City,Bay Terrace,2,2,2
Toronto,Downsview,4,4,4


In [61]:
NBHs.groupby(['City','Borough','Neighborhood']).count().sort_values(by=['Latitude']).tail(10) # after taking into consideration of different Borough, only two 
                                                                                            # still showed up more than once

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Latitude,Longitude
City,Borough,Neighborhood,Unnamed: 3_level_1,Unnamed: 4_level_1
New York City,Manhattan,Gramercy,1,1
New York City,Manhattan,Flatiron,1,1
New York City,Manhattan,Financial District,1,1
New York City,Manhattan,East Village,1,1
New York City,Manhattan,East Harlem,1,1
New York City,Manhattan,Clinton,1,1
New York City,Manhattan,Lenox Hill,1,1
Toronto,York,Weston,1,1
Toronto,North York,Don Mills,2,2
Toronto,North York,Downsview,4,4


In [62]:
NBHs.groupby(['City','Borough','Neighborhood','Longitude']).count().sort_values(by=['Latitude']).tail(10) # That was because these two neighborhoods were broken
                                                                                                        # down by 2 and 4 different postal codes

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Latitude
City,Borough,Neighborhood,Longitude,Unnamed: 4_level_1
New York City,Manhattan,Hudson Yards,-74.000111,1
New York City,Manhattan,Hamilton Heights,-73.949688,1
New York City,Manhattan,Greenwich Village,-73.999914,1
New York City,Manhattan,Gramercy,-73.981376,1
New York City,Manhattan,Flatiron,-73.990947,1
New York City,Manhattan,Financial District,-74.010665,1
New York City,Manhattan,East Village,-73.982226,1
New York City,Manhattan,East Harlem,-73.944182,1
New York City,Manhattan,Lincoln Square,-73.985338,1
Toronto,York,Weston,-79.518188,1


Because our **Foursquare API** requests are based on the each pair of (Latitude, longitude) in the NBHs data frame, for later analysis, we will include the latitude and longitude information as grouping factors for rolling up data to neighborhood level, being aware that Don Mills and Downsview are represented by 2 and 4, respectively, sub-locations in our data.

In [21]:
Venues.to_csv('VenuesFromFoursquare.csv') # save the data so could skip the requesting part if rerun the code

### Methodology

### Results

### Discussions

### Conclusions