# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
1. [Introduction: Business Problem](#Intro)
2. [Data:](#Data)
 - [Toronto Data](#Toronto)
 - [New York's Boroughs Data](#NewYork)
3. [Methodology](#Methodology)
4. [Analysis](#Analysis)
5. [Results and Discussion](#Results)
6. [Conclusion](#Conclusion)

## Introduction: Business Problem <a name="Intro"></a>

In this project we will try to find an optimal location for a business moved from Toronto to New York. Specifically, this report will be targeted to stakeholders interested in targetting New York's neighborhoods that are similar to onces they target in Toronto.

Since there are lots of neighborhoods in New York we will split New York into its 5 boroughs and choose three of them and use them.

We will use our data science powers to generate the most promissing borough based on the similarity. Advantages of each borough will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="Data"></a>

Based on definition of our problem, we will need to get neighborhoods data for both New York and Toronto

We decided to collecte New York and Toronto neighborhoods' names and venues

Following data sources will be needed to extract/generate the required information:
* New York's data will be collected from the same source "Segmenting and Clustering Neighborhoods in New York" notebook got its data since it's a dataset of New York's neighborhoods with the proper geographical coordinate
* For Toronto's data we will web scraping wikipedia page of Canadian postal codes and join it with a **csv file** of a proper geographical coordinate
* For each neighborhood we will get all the venues around it in a radius of 500 and this will be obtained using **Foursquare API**

### The dependencies we will need for all of the notebook.

#### Libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

print("Done")

Done


#### Foursquare credentials and version.

In [2]:
CLIENT_ID = '3VJRRQMVR2F1O1NWGVQIYHQAOULIJ0XCQ4E4MH3V4AQCWVQ5'
CLIENT_SECRET = 'FZFDB4EM3JI4KQQNPZAXGZUKQA3FSJGOLUL1MM3YCQRLLLAG'
VERSION = '20180605'
LIMIT = 100

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: 3VJRRQMVR2F1O1NWGVQIYHQAOULIJ0XCQ4E4MH3V4AQCWVQ5
CLIENT_SECRET:FZFDB4EM3JI4KQQNPZAXGZUKQA3FSJGOLUL1MM3YCQRLLLAG


#### Functions

A function to get the nearby venues within a radius of 500 meters for a given location

In [3]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    empty_indexes=[]
    for i, name, lat, lng in zip(range(len(names)), names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        if(venues_list[-1] == []):
            empty_indexes.append(i)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return nearby_venues, empty_indexes

A function to get The geographical coordinate of a given address

In [4]:
def getCoordinate(address):
    geolocator = Nominatim(user_agent="explorer")
    location = geolocator.geocode(address)
    print('The geographical coordinate of {} are {}, {}.'.format(address, location.latitude, location.longitude))
    return location.latitude, location.longitude

A function to make a folium map for a given location with a given color for the markers

In [5]:
def drawMap(latitude, longitude, data, cir_color):
    Map = folium.Map(location=[latitude, longitude], zoom_start=11)

    for lat, lng, label in zip(data['Latitude'], data['Longitude'], data['Neighborhood']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='dark'+cir_color,
            fill=True,
            fill_color=cir_color,
            fill_opacity=0.7,
            parse_html=False).add_to(Map)  
    
    return Map

A function to rearrange the last column of a dataframe and put it in a new index

In [6]:
def fixLastColumn(data, new_ind):
    old_ind = data.shape[1]-1
    fixed_columns = list(data.columns[:new_ind]) + [data.columns[old_ind]] + list(data.columns[new_ind:old_ind]) + list(data.columns[old_ind+1:])
    return data[fixed_columns]

A function to get the venues frequency of every neighborhood

In [7]:
def getGrouped(venues):
    data_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
    data_onehot['Neighborhood'] = venues['Neighborhood'] 

    data_onehot = fixLastColumn(data_onehot,0)
    
    data_grouped = data_onehot.groupby('Neighborhood').mean().reset_index()
    return data_grouped

A function to calculate the data of a given New York borough

In [8]:
def boroughCal(ny_data, borough, color):
    borough_data = ny_neighborhoods[ny_neighborhoods["Borough"] == borough].reset_index(drop=True)
    address = borough + ', NY'
    borough_latitude, borough_longitude = getCoordinate(address)
    map_borough = drawMap(borough_latitude,
                        borough_longitude,
                        borough_data,
                        color)
    return borough_data, map_borough

A function to retrieve the most common venues

In [9]:
def mostCommonVenues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

A function to get neighborhoods most common venues data combined with locational data

In [10]:
def topVenues(venues_data, num_top_venues=10):
    indicators = ['st', 'nd', 'rd']
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = venues_data['Neighborhood']

    for ind in np.arange(venues_data.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = mostCommonVenues(venues_data.iloc[ind, :], num_top_venues)

    return neighborhoods_venues_sorted

A function to count the number of a given New York borough's and Toronto's neighborhoods in a given cluster

In [11]:
def clusterNumber(i, ny_borough, cluster_merged):
    cluster_citys = cluster_merged[cluster_merged["Cluster Labels"] == i].reset_index(drop=True)
    
    num_ny_borough = cluster_citys[cluster_citys["City"] == ny_borough].shape[0]
    num_toronto = cluster_citys[cluster_citys["City"] == "Toronto"].shape[0]
    return {"Cluster":i, ny_borough:num_ny_borough, "Toronto":num_toronto}

A function to return the neighborhoods of a given New York borough and Toronto in a given cluster

In [12]:
def clusterNeighborhoods(i, ny_borough, cluster_merged):
    cluster_citys = cluster_merged[cluster_merged["Cluster Labels"] == i].reset_index(drop=True)
    
    ny_borough_cluster_neighborhoods = cluster_citys[cluster_citys["City"] == ny_borough]
    toronto_cluster_neighborhoods = cluster_citys[cluster_citys["City"] == "Toronto"]
    result = pd.DataFrame.transpose(pd.DataFrame(data = [list(toronto_cluster_neighborhoods["Neighborhood"]), list(ny_borough_cluster_neighborhoods["Neighborhood"])]))
    result = result.replace(np.nan,"")
    result.columns = ["Toronto", ny_borough]
    return result

A function to cluster the data of Toronto and a given New York borough

In [13]:
def clustering(toronto_data, ny_borough_data,
               toronto_grouped, ny_borough_grouped,
               toronto_neighborhoods_venues_sorted, ny_borough_neighborhoods_venues_sorted,
               ny_borough, kclusters=3):
    city_column = [ny_borough for x in range(ny_borough_grouped.shape[0])]
    for x in range(toronto_grouped.shape[0]):
        city_column.append("Toronto")
    
    cluster_grouped = ny_borough_grouped
    cluster_grouped = cluster_grouped.append(toronto_grouped).reset_index(drop=True)
    cluster_grouped = cluster_grouped.replace(np.nan, 0)
    
    cluster_neighborhoods_venues_sorted = ny_borough_neighborhoods_venues_sorted
    cluster_neighborhoods_venues_sorted = cluster_neighborhoods_venues_sorted.append(toronto_neighborhoods_venues_sorted).reset_index(drop=True)
    
    cluster_data = ny_borough_data
    cluster_data = cluster_data.append(toronto_data).reset_index(drop=True)
    
    clustering_data = cluster_grouped.drop('Neighborhood', axis=1)

    kmeans = KMeans(init="k-means++", n_clusters=kclusters, random_state=0).fit(clustering_data)
    cluster_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
    cluster_merged = cluster_data
    cluster_merged = cluster_merged.join(cluster_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

    cluster_merged["City"] = city_column
    cluster_merged = fixLastColumn(cluster_merged,1)

    cluster_res = pd.DataFrame(columns = ["Cluster", ny_borough, "Toronto"])
    for i in range(kclusters):
        cluster_res = cluster_res.append(clusterNumber(i, ny_borough, cluster_merged),ignore_index=True)
    cluster_res = cluster_res.set_index('Cluster')
    return cluster_merged, cluster_res

## Toronto Data <a name="Toronto"></a>

Web Scraping to get the data

In [14]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(page,"html.parser")

fixing and cleaning the data

In [15]:
table_contents = []
table = soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

toronto_neighborhoods = pd.DataFrame(table_contents)
toronto_neighborhoods['Borough'] = toronto_neighborhoods['Borough'].replace({
                                                        'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                                       'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                                       'EtobicokeNorthwest':'Etobicoke Northwest',
                                                       'East YorkEast Toronto':'East York/East Toronto',
                                                       'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

df_geo = pd.read_csv('https://cocl.us/Geospatial_data',index_col = "Postal Code")

toronto_neighborhoods = toronto_neighborhoods.join(df_geo, on='PostalCode')
toronto_neighborhoods = toronto_neighborhoods.drop('PostalCode', axis = 1)

toronto_data = toronto_neighborhoods[["Toronto" in x for x in toronto_neighborhoods["Borough"]]].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,Downtown Toronto,St. James Town,43.651494,-79.375418
3,East Toronto,The Beaches,43.676357,-79.293031
4,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [16]:
address = 'toronto, Canada'

toronto_latitude, toronto_longitude = getCoordinate(address)

The geographical coordinate of toronto, Canada are 43.6534817, -79.3839347.


In [17]:
map_toronto = drawMap(toronto_latitude,
                      toronto_longitude,
                      toronto_data,
                      'red')

map_toronto

#### Toronto Venues

In [18]:
toronto_venues, toronto_empty_indexes = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

toronto_data = toronto_data.drop(toronto_empty_indexes).reset_index()

Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Enclave of M5E
St. James Town, Cabbagetown
First Canadi

In [19]:
toronto_grouped = getGrouped(toronto_venues)
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.015873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.015873,0.0
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
toronto_neighborhoods_venues_sorted = topVenues(toronto_grouped)
toronto_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Farmers Market,Pharmacy,Restaurant,Beer Bar,Cheese Shop,Seafood Restaurant,Belgian Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Nightclub,Gym,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store
2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Harbor / Marina,Boutique,Rental Car Location,Bar,Plane,Sculpture Garden
3,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Middle Eastern Restaurant,Salad Place,Bubble Tea Shop,Burger Joint,Yoga Studio,Ramen Restaurant
4,Christie,Grocery Store,Café,Park,Baby Store,Candy Store,Restaurant,Italian Restaurant,Athletics & Sports,Coffee Shop,Nightclub


## New York's Boroughs Data <a name="NewYork"></a>

Get the data from a JSON file

In [21]:
url ="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json"
json_txt = requests.get(url).json()
json_txt

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

fixing and cleaning the data

In [22]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
ny_neighborhoods = pd.DataFrame(columns=column_names)

for data in json_txt['features']:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [23]:
for borough in set(ny_neighborhoods["Borough"]):
    print(borough, ny_neighborhoods[ny_neighborhoods["Borough"] == borough].shape)

Brooklyn (70, 4)
Manhattan (40, 4)
Bronx (52, 4)
Staten Island (63, 4)
Queens (81, 4)


We will use the boroughs with the least number of neighborhood. Hence we will use Manhattan, Bronx, Staten Island.

### Manhattan Data

In [24]:
manhattan_data, map_manhattan = boroughCal(ny_neighborhoods, "Manhattan", "Blue")
manhattan_data.head()

The geographical coordinate of Manhattan, NY are 40.7896239, -73.9598939.


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [25]:
map_manhattan

#### Manhattan Venues

In [26]:
manhattan_venues, manhattan_empty_indexes = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

manhattan_data = manhattan_data.drop(manhattan_empty_indexes).reset_index()

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


In [27]:
manhattan_grouped = getGrouped(manhattan_venues)
manhattan_neighborhoods_venues_sorted = topVenues(manhattan_grouped)
manhattan_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Hotel,Boat or Ferry,Gym,Memorial Site,Playground,Shopping Mall,Clothing Store,BBQ Joint
1,Carnegie Hill,Coffee Shop,Café,Yoga Studio,Grocery Store,Bookstore,French Restaurant,Pizza Place,Gym,Wine Shop,Bar
2,Central Harlem,Seafood Restaurant,Chinese Restaurant,Fried Chicken Joint,African Restaurant,American Restaurant,French Restaurant,Bar,Gym / Fitness Center,Public Art,Grocery Store
3,Chelsea,Coffee Shop,Art Gallery,Bakery,American Restaurant,French Restaurant,Italian Restaurant,Ice Cream Shop,Seafood Restaurant,Cycle Studio,Thai Restaurant
4,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,Hotpot Restaurant,American Restaurant,Dessert Shop,Optical Shop,Spa,Salon / Barbershop,Asian Restaurant


### Bronx Data

In [28]:
bronx_data, map_bronx = boroughCal(ny_neighborhoods, "Bronx", "Green")
map_bronx

The geographical coordinate of Bronx, NY are 40.8466508, -73.8785937.


#### Bronx venues

In [30]:
bronx_venues, bronx_empty_indexes = getNearbyVenues(names=bronx_data['Neighborhood'],
                                   latitudes=bronx_data['Latitude'],
                                   longitudes=bronx_data['Longitude']
                                  )

bronx_data = bronx_data.drop(bronx_empty_indexes).reset_index()

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Claremont Village
Concourse Village
Mount Eden
Mount Hope
Bronxdale
Allerton
Kingsbridge Heights


In [31]:
bronx_grouped = getGrouped(bronx_venues)
bronx_neighborhoods_venues_sorted = topVenues(bronx_grouped)
bronx_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,Pizza Place,Supermarket,Discount Store,Bus Station,Deli / Bodega,Spa,Check Cashing Service,Electronics Store,Pharmacy,Donut Shop
1,Baychester,Bank,Donut Shop,Mattress Store,Burger Joint,Mexican Restaurant,Bus Station,Spanish Restaurant,Supermarket,Electronics Store,Sandwich Place
2,Bedford Park,Chinese Restaurant,Pizza Place,Diner,Mexican Restaurant,Sandwich Place,Deli / Bodega,Food Truck,Donut Shop,Spanish Restaurant,Discount Store
3,Belmont,Italian Restaurant,Deli / Bodega,Pizza Place,Bakery,Dessert Shop,Bank,Grocery Store,Sandwich Place,Donut Shop,Fish Market
4,Bronxdale,Chinese Restaurant,Spanish Restaurant,Gym,Pizza Place,Mexican Restaurant,Eastern European Restaurant,Bank,Breakfast Spot,Italian Restaurant,Convenience Store


### Staten Island Data

In [32]:
staten_island_data, map_staten_island = boroughCal(ny_neighborhoods, "Staten Island", "Orange")

map_staten_island

The geographical coordinate of Staten Island, NY are 40.5834557, -74.1496048.


#### Staten Island venues

In [33]:
staten_island_venues, staten_island_empty_indexes = getNearbyVenues(names=staten_island_data['Neighborhood'],
                                   latitudes=staten_island_data['Latitude'],
                                   longitudes=staten_island_data['Longitude']
                                  )

staten_island_data = staten_island_data.drop(staten_island_empty_indexes).reset_index()

St. George
New Brighton
Stapleton
Rosebank
West Brighton
Grymes Hill
Todt Hill
South Beach
Port Richmond
Mariner's Harbor
Port Ivory
Castleton Corners
New Springville
Travis
New Dorp
Oakwood
Great Kills
Eltingville
Annadale
Woodrow
Tottenville
Tompkinsville
Silver Lake
Sunnyside
Park Hill
Westerleigh
Graniteville
Arlington
Arrochar
Grasmere
Old Town
Dongan Hills
Midland Beach
Grant City
New Dorp Beach
Bay Terrace
Huguenot
Pleasant Plains
Butler Manor
Charleston
Rossville
Arden Heights
Greenridge
Heartland Village
Chelsea
Bloomfield
Bulls Head
Richmond Town
Shore Acres
Clifton
Concord
Emerson Hill
Randall Manor
Howland Hook
Elm Park
Manor Heights
Willowbrook
Sandy Ground
Egbertville
Prince's Bay
Lighthouse Hill
Richmond Valley
Fox Hills


In [34]:
staten_island_grouped = getGrouped(staten_island_venues)
staten_island_neighborhoods_venues_sorted = topVenues(staten_island_grouped)
staten_island_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Annadale,Pizza Place,Restaurant,American Restaurant,Diner,Sushi Restaurant,Train Station,Liquor Store,Bar,Park,Farmers Market
1,Arden Heights,Deli / Bodega,Bus Stop,Coffee Shop,Pharmacy,Pizza Place,Yoga Studio,Farmers Market,Food & Drink Shop,Food,Flower Shop
2,Arlington,Bus Stop,Deli / Bodega,American Restaurant,Boat or Ferry,Liquor Store,Grocery Store,Fast Food Restaurant,Food Truck,Food & Drink Shop,Food
3,Arrochar,Pizza Place,Italian Restaurant,Bus Stop,Deli / Bodega,Pharmacy,Sandwich Place,Middle Eastern Restaurant,Outdoors & Recreation,Supermarket,Food Truck
4,Bay Terrace,Supermarket,Italian Restaurant,Sushi Restaurant,Home Service,Grocery Store,Playground,Donut Shop,Salon / Barbershop,Shipping Store,Insurance Office


# Methodology <a name="Methodology"></a>

In this project we will direct our efforts on detecting the borough of New York that is the most similar to Toronto. We will limit our analysis to three boroughs.

In first step we have collected the required **data: locations and venues information within a raduis of 500**

Second step in our analysis will be clustering each borough of the three boroughs we choose with Toronto - we will use **K-Means** to do the clustering proccess

In third and final step we will focus on the most promising borough and view more informatio of the neighborhoods in that borough

## Analysis (Clustering) <a name="Analysis"></a> 

We will be using **K-Means** as a clustering method and we choose to cluster the data into 3 clusters in hope to split the data into 3 main parts:
1. Toronto's neighborhoods that aren't similar to any of the ones of the selected New York borough
2. Neighborhoods that are similar
3. New York borough's neighborhoods that aren't similar to any Toronto's once
But the analysis will not be that precise i.e. it will split the data in different way

#### Manhattan clustering

In [35]:
manhattan_cluster_data, manhattan_cluster_res = clustering(toronto_data, manhattan_data,
               toronto_grouped, manhattan_grouped,
               toronto_neighborhoods_venues_sorted, manhattan_neighborhoods_venues_sorted,
               "Manhattan")
manhattan_cluster_res

Unnamed: 0_level_0,Manhattan,Toronto
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,40,33
1,0,1
2,0,5


#### Bronx clustering

In [36]:
bronx_cluster_data, bronx_cluster_res = clustering(toronto_data, bronx_data,
               toronto_grouped, bronx_grouped,
               toronto_neighborhoods_venues_sorted, bronx_neighborhoods_venues_sorted,
               "Bronx")
bronx_cluster_res

Unnamed: 0_level_0,Bronx,Toronto
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,45,1
1,6,33
2,1,5


#### Staten Island clustering

In [37]:
staten_island_cluster_data, staten_island_cluster_res = clustering(toronto_data, staten_island_data,
               toronto_grouped, staten_island_grouped,
               toronto_neighborhoods_venues_sorted, staten_island_neighborhoods_venues_sorted,
               "Staten Island")
staten_island_cluster_res

Unnamed: 0_level_0,Staten Island,Toronto
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2,4
1,39,35
2,21,0


We found that Staten Island is the most similar to Toronto so we will view the neighborhoods of each cluster of the clustering process between Toronto's data and Staten Island's data

In [38]:
cluster_0_neighborhoods = clusterNeighborhoods(0,
                                               "Staten Island",
                                               staten_island_cluster_data)
cluster_0_neighborhoods

Unnamed: 0,Toronto,Staten Island
0,The Danforth East,Todt Hill
1,Forest Hill North & West,Randall Manor
2,"Moore Park, Summerhill East",
3,Rosedale,


In [39]:
cluster_1_neighborhoods = clusterNeighborhoods(1,
                                               "Staten Island",
                                               staten_island_cluster_data)
cluster_1_neighborhoods

Unnamed: 0,Toronto,Staten Island
0,"Regent Park, Harbourfront",St. George
1,"Garden District, Ryerson",Stapleton
2,St. James Town,Rosebank
3,The Beaches,West Brighton
4,Berczy Park,Grymes Hill
5,Central Bay Street,South Beach
6,Christie,Castleton Corners
7,"Richmond, Adelaide, King",New Springville
8,"Dufferin, Dovercourt Village",Travis
9,"Harbourfront East, Union Station, Toronto Islands",New Dorp


In [40]:
cluster_2_neighborhoods = clusterNeighborhoods(2,
                                               "Staten Island",
                                               staten_island_cluster_data)
cluster_2_neighborhoods

Unnamed: 0,Toronto,Staten Island
0,,New Brighton
1,,Port Richmond
2,,Mariner's Harbor
3,,Tottenville
4,,Park Hill
5,,Arlington
6,,Arrochar
7,,Grasmere
8,,New Dorp Beach
9,,Butler Manor


## Results and Discussion <a name="Results"></a>

Our analysis shows that although the three boroughs we studied are very similar to Toronto but we found a borough that is the most similar to Toronto, so we focused our attention to Staten Island

After directing our attention to Staten Island we provide more info of its neighborhoods and link the similar neighborhoods in Toronto and Staten Island to each other

Result of all this is a list of Toronto's neighborhoods and their similar ones in Staten Island. This, of course, does not imply that those neighborhoods are actually optimal locations for a business moved from Toronto to New York! Purpose of this analysis was to only provide info on. Recommended neighborhoods should be considered only as a starting point for more detailed analysis which could eventually result in location which has also other factors taken into account and all other relevant conditions met.

## Conclusion <a name="Conclusion"></a>

Purpose of this project was to find an optimal location for a business moved from Toronto to New York in order to aid stakeholders in narrowing down the search for optimal location in New York boroughs to target. By using Foursquare to gather data and clustering Toronto's data with each borough's data we have first identified the borough that is the most similar to Toronto (Staten Island), and then we list the similar neighborhoods to be used as starting points for final exploration by stakeholders.

Final decision on optimal neighborhood to be targeted will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended neighborhood.