# **The Battle of the Neighborhoods - New York vs Toronto**
## Final Project of the 'Applied Data Science Capstone' in Coursera

## 1- Introduction

In this final capstone project we will use the Foursquare API to explore neighborhoods in New York City and Toronto. We will use the explore function to get the most common venue categories in each neighborhood and then use this feature to group the neighborhoods of the two cities into clusters and determine how similar or dissimilar they are.

This information will be useful to people moving from one city to the other, to help them in their decision making concerning which neighborhood to move into. Some people will prefer to move into a new neighborhood in a new city with similar caractheristics, in terms of venues, to the city previously inhabited. Some people might be interested to find something rather different. 

## 2- Data

First let's import all the dependencies that we will need throughout this project:

In [None]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

print('Libraries imported.')

Now let's get the New York City dataset ready.

In [11]:
import wget
url= 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json'
wget.download(url)

'newyork_data.json'

In [12]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Now that we have the JSON file let's tranform the data into a pandas dataframe

In [17]:
NYneighborhoods_data = newyork_data['features']

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

NYneighborhoods = pd.DataFrame(columns=column_names)

for data in NYneighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    NYneighborhoods = NYneighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Let's have a look at the New York City dataframe:

In [18]:
NYneighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [21]:
print('The New York dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        NYneighborhoods.shape[0]
    )
)

The New York dataframe has 5 boroughs and 306 neighborhoods.


OK, the New York dataframe is ready. 

Now let's create the Toronto dataframe. In this cases we will use the BeautifulSoup package to scrape the table from Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [35]:
url= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url,'lxml')
table = soup.find('table',{'class':'wikitable sortable'})

Now we create the pandas dataframe

In [36]:
A=[]
B=[]
C=[]

for row in table.findAll('tr'):
    cells= row.findAll('td')
    if len(cells)==3:
        mlnk=cells[0].findAll('a')
        A.append(cells[0].find(text=True).strip())
        B.append(cells[1].find(text=True).strip())
        C.append(cells[2].find(text=True).strip())

Tneighborhoods= pd.DataFrame(A, columns=['PostalCode'])
Tneighborhoods['Borough']=B
Tneighborhoods['Neighbourhood']=C
Tneighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Just a couple more steps to get the dataframe ready to go. First we will change the column name 'Neighbourhood' to 'Neighborhood' (to match the NY dataframe) and drop 'Boroughs' with 'Not assigned' names.

In [37]:
Tneighborhoods = Tneighborhoods.rename(columns={'Neighbourhood': 'Neighborhood'})

In [38]:
indexNames = Tneighborhoods[Tneighborhoods['Borough'] == 'Not assigned' ].index
Tneighborhoods.drop(indexNames , inplace=True)
Tneighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


And get the coordinates of the neighborhoods from a csv file available at https://cocl.us/Geospatial_data and merge it with our Toronto dataframe

In [39]:
coordf = pd.read_csv('https://cocl.us/Geospatial_data')
coordf = coordf.rename(columns={'Postal Code': 'PostalCode'})
Tneighborhoods= pd.merge(Tneighborhoods, coordf, on='PostalCode')
Tneighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Let's just drop the PostalCode column which is not useful anymore

In [40]:
Tneighborhoods= Tneighborhoods.drop('PostalCode', 1)
Tneighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,North York,Parkwoods,43.753259,-79.329656
1,North York,Victoria Village,43.725882,-79.315572
2,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Alright, now we have both New York and Toronto dataframes ready. Let's merge them together into one single dataframe.

In [41]:
dataframes = [NYneighborhoods, Tneighborhoods]
df = pd.concat(dataframes)
df

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
...,...,...,...,...
98,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


And 'voilá', we have our dataframe with the Boroughs of New York and Toronto and their respective Neighborhoods, as well as their latitude and longitude. We are now ready for the analysis, to explore the venues in each city using the Foursquare API and compare both cities and their neighborhoods. 