# Moving from Toronto to New York City

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to solve a real world problem. We will try to solve a problem that must be faced by those people who in a company are engaged in hiring staff.

Imagine that you live in Toronto and you receive an offer of work in Manhattan, New York, but you do not know if you accept because you are very comfortable in your neighborhood and doubts if you can find one that suits both you and your family there.

In this project, we will try to create a model that solves this problem and that for any neighborhood in Toronto find one in Manhattan with similar characteristics (or as similar as possible): parks, cafes, restaurants, etc. In this way, it will be easier to convince that person that he/she wants to work in your company.

To do this, we will first obtain a list of neighborhoods in Toronto, choose the one we want to compare with those in Manhattan, compare it and analyze the results.

The objective is to find the group of districts in Manhattan that best suits the characteristics of the one we selected in Toronto. For this we will use the k-means clustering technique.

In this way, we will have more than one compatible district in the city of New York with that chosen in Toronto, which is even good for our research because there are other factors to make a decision like this one that we have not taken into account as where would be the new office or children's nursery school.

## Data <a name="data"></a>

Based on definition of our problem, we will essentially need:
- A dataset containing the list of the Toronto's neighborhoods: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . We will use this dataset to explore the Toronto's neighborhoods and to choose the one where we currently live. To do it, we'll transforom this data into a pandas dataframe.
    We will also need the coordinates of each neighborhood to represent them on a map and make our study more visual and to do this we will use the following dataset: https://cocl.us/Geospatial_data


- A dataset containing the list of the New York City's neighborhoods and its coordinates: https://geo.nyu.edu/catalog/nyu_2451_34572. We will have to extract the Manhattan's neighborhoods from this dataset.


- Data about what we are more likely to find around a given area: restaurants, parks, italian food, etc. To get that information we will use **Foursquare API**. 

Finally, we will use the k-means clustering technique to find the most suitables Manhattan's neighborhoods clustering all of them into 5 groups.

##### Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd 

# library to handle JSON files
import json 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# library to handle requests
import requests 

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium 

print('Libraries imported.')

Libraries imported.


We will firstly load our data sets and transform them into a pandas dataframes. Then we will edit those dataframes to just have the revelant data. 

We start manipulating the New York City data.


In [60]:
# NEW YORK DATAFRAME
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features'] # all the relevant data is in the features key

# define the dataframe columns and write the data into the pandas dataframe
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
new_york = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    new_york = new_york.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

print("The New York Data", " (size: ", new_york.shape, ")")
new_york.head()


The New York Data  (size:  (306, 4) )


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585



We are just interested on the data of the neighborhoods in Manhattan, so let's exctract it from this dataframe.


In [59]:
manhattan_data = new_york[new_york['Borough'] == 'Manhattan'].reset_index(drop=True)
print("The Manhattan Data", " (size: ", manhattan_data.shape, ")")
manhattan_data.head()

The Manhattan Data  (size:  (40, 4) )


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


We move on to manipulate the Toronto data.

In [30]:
# TORONTO DATAFRAME
toronto = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
toronto.to_csv('beautifulsoup_pandas.csv')
toronto.rename(columns={'Neighbourhood' :'Neighborhood'}, inplace=True)

print('The Toronto Data', toronto.shape)
toronto.head()

The Toronto Data (289, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We clean the Toronto dataset removing those entries with a non "Not assigned" borough and modifying those duplicate entries in relation with the postal code but with different neighborhood concatenating the neighborhoods.

In [31]:
toronto_clean = pd.DataFrame(columns = toronto.columns)
for index, element in toronto.iterrows():
    if element['Borough'] != 'Not assigned':
        if element['Postcode'] in toronto_clean['Postcode'].unique():
             for index_,element_ in toronto_clean.iterrows():
                    if (element_['Postcode'] == element['Postcode']) and (element['Neighborhood'] not in element_['Neighborhood']):
                        element_['Neighborhood'] = element_['Neighborhood'] + ', ' + element['Neighborhood']
        else:
            toronto_clean.loc[len(toronto_clean)] = element
            
for index, element in toronto_clean.iterrows():
    if element['Neighborhood'] == 'Not assigned':
        element['Neighborhood'] = element['Borough']
        
print('The Toronto Data Clean', toronto_clean.shape)
toronto_clean.head()

The Toronto Data Clean (103, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


We add to our data the geospatial coordinates of each entry. We use the information described at the beginning of the "Data" section.

In [32]:
toronto_coord = pd.read_csv('https://cocl.us/Geospatial_data')
toronto_coord_order = pd.DataFrame(columns = ['Latitude', 'Longitude'])
for index, element in toronto_clean.iterrows():
    for index_, element_ in toronto_coord.iterrows():
        if element['Postcode'] == element_['Postal Code']:
             toronto_coord_order.loc[len(toronto_coord_order)] = [element_['Latitude'], element_['Longitude']]
toronto_clean['Latitude'] = toronto_coord_order['Latitude']
toronto_clean['Longitude'] = toronto_coord_order['Longitude']

print('The Toronto Data Clean with the geospatial coordinates', toronto_clean.shape)
toronto_clean.head()

The Toronto Data Clean with the geospatial coordinates (103, 5)


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


Let's visualize the data.

In [53]:
address_man = 'Manhattan, NY'

geolocator_man = Nominatim(user_agent="manhattan_explorer")
location_man = geolocator_man.geocode(address_man)
latitude_man = location_man.latitude
longitude_man = location_man.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude_man, longitude_man))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


In [54]:
map_manhattan = folium.Map(location=[latitude_man, longitude_man], zoom_start=11)
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [55]:
address_tor = 'Toronto'

geolocator_tor = Nominatim(user_agent="toronto_explorer")
location_tor = geolocator_tor.geocode(address_tor)
latitude_tor = location_tor.latitude
longitude_tor = location_tor.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude_tor, longitude_tor))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [56]:
map_toronto = folium.Map(location=[latitude_tor, longitude_tor], zoom_start=10)
for lat, lng, borough, neighborhood in zip(toronto_clean['Latitude'], toronto_clean['Longitude'], toronto_clean['Borough'], toronto_clean['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>