# Capstone Project - The Battle of Neighborhoods (Week 2)

# Introduction/Business Problem section:  

This project was manly to assist people intending to or already exploring the best of the two major cities between New York City and Toronto, based on facilities available in their neighborhoods, using location data to make better and informed decisions on selecting the best neighborhoods. This is informed by the fact that such people are moving to and from different locations and would like to make decisions on whther to settle in Toronto or New York and have the need to explore and research for great locations to settle their families, based on factors including best schools locations, hospitals, malls, amongst other amenities.

The aim of this project is hence to develop an analysis of main determining features features for migrations to one of the either major cities of Toronto, Ontario (Canada) or New York City (USA), creating a better awareness of the neighborhood amenities for the end user, before moving.

Achieving this purpose necessitates the comparison of the neighborhoods of the two major cities and determine how similar or dissimilar they are to each other.

What would be compared in the two major cities include but not limited to the neighborhood types (which of the two is well defined and uniform.

Toronto remains a popular migration destination in Canada, located in the province of Ontario in Canada. Having attracted different groups from different walks of life, Toronto is home to diversity and multicultural nature of population make up. The main Neighborhood of interest in Toronto is Scaraborough. 

New York City on the other hand also represents a diverse and multicultural make up and would be easily assumed to be similar to Toronto, but dissimilarities exist within the two major cities hence the need for this comparison. The main neighborhood of interest in New York is Manhattan. 

# Data Description:

The crux of this project is based on the analysis of the boroughs and neighborhoods in both Toronto and New York Cities, to help properly segment the neighborhoods and explore them. We therefore essentially needed a dataset that contains such boroughs and neighborhoods that exist in each city as well as the the latitude and logitude coordinates of each neighborhood. 

The Foursquare API provides the prime data gathering source as it has a database of millions of places, especially their places API which provides the ability to perform location search, location sharing and details about a business.

The New York City has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

Unlike in the case of New York, the neighborhood data for Scarborough, Toronto is not readily available on the internet, which presents an interesting fact about the field of data science that each project can be challenging in its unique way hence the need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.The main data source available for this project is the Wikipedia page from the link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

# Use of the Foursquare API:

To achieve the ends of this project, there's the need to gain indepth data about the different neighborhoods of New York and Toronto cities with the choice of data source being the Foursquare's API location data.

This is informed by the fact that Foursquare's API has an indepth data about locations which include, but not limited to pictures, name of venues, menus(where necessary) and locations.The output data obtained from the Foursquare API had venue information with specified distances of the longitude and latitude of the postcodes. The information obtained per venue as follows:

    Neighborhoods themselves;
    Neighborhood Latitudes;
    Neighborhood Longitudes;
    Venues and their names;
    Venue Latitudes;
    Venue Longitudes;
    Venue Categories.

The geograpical coordinate of New York City are 40.7127281, -74.0060152 while the geograpical coordinate of Toronto city are 43.6534817, -79.3839347. 

# Python Libraries & Models to be used:

To achieve the ends of this project, the following Python Modules would be used:
 1. Beautiful Soup and Requests: For web scrapping, automation & handling HTTP requests;
 2. Pandas: For creating and manipulating dataframes;
 3. Matplotlib: ForCharts and data Plotting;
 4. Folium: For visualiing the neighborhoods cluster distribution of using interactive leaflet map;
 5. Scikit Learn: For importing k-means clustering;
 6. Geocoder: For retrieving Location Data;
 7. XML: To separate data from presentation and XML stores data in plain text format;
 8. JSON: For Handling JSON files

In [1]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json 

from geopy.geocoders import Nominatim 

import requests
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium 

print('Libraries imported.')

Libraries imported.


# K-Means Clustering Approach:

There exisists different possible clustering models for clustering but for purposes of this project, we  intend to  present the model that is considered the one of the simplest model among them which is the K-Means Clustering Approach. Despite its simplicity, this approach is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data

# Methodology:

## 1. We'll import the boroughs and neighborhood list of Toronto from Wikipedia and convert it to data frame using pandas package in python. 

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
get = requests.get(url).text
bSoup = BeautifulSoup(get, 'lxml')

In [3]:
#Find table within the Scrap output
table = bSoup.find('table')

In [4]:
columnNames = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = columnNames)

In [5]:
# Iterate through the table to find all the Postcodes, Boroughs, Neighborhoods
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data 

#Show the first 5 rows to ensure correct findings
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [6]:
#Next step is to remove the rows where Borough is "Not Assigned", then show first 10 rows

iNames = df[ df['Borough'] =='Not assigned'].index
df.drop(iNames , inplace=True)

df.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
##If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

df.loc[df['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df['Borough']
df.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


## 2. Another data set comprised of location data of neighborhood and boroughs will be imported in .csv format and then converted to data frame. 

In [8]:
##Get Geographical Coordinates of each neighborhoods
df_geo = pd.read_csv('https://cocl.us/Geospatial_data')
df_geo.columns=['Postalcode','Latitude','Longitude']
df_geo.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
##Now to combine two rowsvlisted twice into one row with the neighborhoods separated with a comma
result = df.groupby(['Postalcode','Borough'], sort=False).agg( ', '.join)
new_df = result.reset_index()
new_df.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


## 3. After Cleaning the data set, two tables will be merged to get the final Toronto neighborhood data set.

In [10]:
df_Toronto = pd.merge(new_df,df_geo[['Postalcode','Latitude', 'Longitude']],on='Postalcode')

## 4. The Geo location data of New York will be imported in .json format. 

In [11]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [12]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

## 5. Then the New York Neighborhoods, Boroughs and their corresponding latitude and longitude will be filtered out and converted to a data frame.

In [13]:
neighborhoods_data = newyork_data['features']

In [14]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [15]:
# Loop through the data and fill the dataframe one row at a time.
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [16]:
#Make sure that the dataset has all 5 boroughs and 306 neighborhoods.
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## 6. Having done this, with the output data of the neighborhood location data for each city, using Foursquare API, all venues data will be imported into two data frame for each neighborhood in Toronto and New York.

In [17]:
# defining radius and limit of venues to get
radius=500
LIMIT=100

In [18]:
CLIENT_ID = "EBFPSQTL2LUC2YY5BCCQX5CGNOP0FRHZAIHU2CVVT2UA3AY2"
CLIENT_SECRET = "ZPOHR0ANRP10FQJ0PZ2TFEFH1YAJM3RAMJL5QBQ1XSYRRXLC"
VERSION = '20180605'

def getNearbyVenues(names, latitudes, longitudes, radius, LIMIT):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

 # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

    return(nearby_venues)

In [19]:
toronto_venues = getNearbyVenues(names = df_Toronto['Neighborhood'], latitudes = df_Toronto['Latitude'],
                                 longitudes = df_Toronto['Longitude'], 
                                 radius=500, LIMIT=100
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [20]:
toronto_venues.head(10)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
5,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
6,Victoria Village,43.725882,-79.315572,The Frig,43.727051,-79.317418,French Restaurant
7,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
8,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
9,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


In [21]:
toronto_venues.shape

(2129, 7)

In [22]:
#Now for New York
New_York_Venues = getNearbyVenues(names = neighborhoods['Neighborhood'], latitudes = neighborhoods['Latitude'],
                                 longitudes = neighborhoods['Longitude'], 
                                 radius=500, LIMIT=100
                                  )

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend


KeyError: 'groups'

In [None]:
New_York_Venues.head(10)

In [None]:
New_York_Venues.shape

## 7. Using the Sci-kit learn module of Python, We'll then introduce the K-Means Clustering model, taking clusters of 3 for both cities, and the labelled neighborhood data will be plotted in a map using folium package.

In [24]:
where = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(where)
tr_latitude = location.latitude
tr_longitude = location.longitude
print('The geograpical coordinate of Toronto city are {}, {}.'.format(tr_latitude, tr_longitude))

The geograpical coordinate of Toronto city are 43.6534817, -79.3839347.


In [31]:
#Create Toronto map based on the above latitude and longitude values
Toronto_map = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(df_Toronto['Latitude'],df_Toronto['Longitude'],df_Toronto['Borough'],df_Toronto['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='red',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(Toronto_map)
       
Toronto_map

In [33]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [35]:
# create map of New York using latitude and longitude values
NewYork_Map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(NewYork_Map)  
    
NewYork_Map

# Using KMeans clustering for the clsutering of the neighbourhoods

### Toronto

In [39]:
k=5
toronto_clustering = df_Toronto.drop(['Postalcode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df_Toronto.insert(0, 'Cluster Labels', kmeans.labels_)

ValueError: cannot insert Cluster Labels, already exists

In [40]:
df_Toronto.head()

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,0,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [47]:
# create Toronto cluster map
toronto_map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Neighborhood'], df_Toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_map_clusters)
       
toronto_map_clusters

### New York

In [44]:
k=5
newyork_clustering = neighborhoods.drop(['Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(newyork_clustering)
kmeans.labels_
neighborhoods.insert(0, 'Cluster Labels', kmeans.labels_)

In [45]:
neighborhoods.head()

Unnamed: 0,Cluster Labels,Borough,Neighborhood,Latitude,Longitude
0,4,Bronx,Wakefield,40.894705,-73.847201
1,4,Bronx,Co-op City,40.874294,-73.829939
2,4,Bronx,Eastchester,40.887556,-73.827806
3,4,Bronx,Fieldston,40.895437,-73.905643
4,4,Bronx,Riverdale,40.890834,-73.912585


In [49]:
# create New York cluster map
newyork_map_clusters = folium.Map(location=[40.7127281, -74.0060152],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood'], df_Toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(newyork_map_clusters)
       
newyork_map_clusters

# Results:

It can be seen that Toronto has five big clusters (abput 25% of the neighborhoods) and a smaller onewith no insignificant clusters compared to them. These are based on Boroughs, Neighborhoods and Postal codes.


For New York, there is one big (83%) and two mid size clusters. Other two clusters are insignificant compared to them. 

# Conclusion

In conclusion, with regard to neighborhood types, Toronto seems to have more defined and uniform neighborhood types while  New York has much more varieties. 

### Thank you.