## Comparing Diverse Cities to help People Relocate

## Introduction

In business people are always being asked to relocate to new cities. This can be an extremely stressful and overwhelming event for the individual being relocated. To help reduce this stress, it would be nice to help the individuals compare their current location to their new location.
Before smart phones, social media and the prevalence of machine learning algorithms, a person would have to rely on other people’s suggestions or to have to read books about the new city they are going to. They may also choose to take some time to visit the new city. Since cities are big, it could take more than a week to visit all the neighborhoods. 
They may already know that they want to move to a neighborhood that is very similar to their current neighborhood. However, some people like to try new things and they may want to find a neighborhood that is very different.
It would be great if they could have the neighborhoods narrowed down to their desired needs. In our case we will assume the individual is being relocated from New York to Toronto.


## Data Acquisition and Cleansing	
Venues Data
We will use the Foursquaree venue data API to get the most frequented venues. Since this API requires the latitude and longtitude of the neighborhoods, we need to get this data from other sources,
New York Neighborhood Coordinates Data
	We can use the publicly available dataset from  https://geo.nyu.edu/catalog/nyu_2451_34572

# Toronto Neighborhood Coordinates Data
	
	We can use the wikipedia page 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M that has a list of all the boroughs and neighborhoods in Toronto. We will extract this list from an HTML table using beautiful soup. Unfortunately I was not able to get GEOCODE to work to get the latitude and longtitude of the neighborhoods. I use the Geospatial Coordinates provide by the class.

# Cleansing & restrictions
The foursquare API limits the number of calls you can make per session. This forced me to reduce the number of neighborhoods used in the analysis. I am going to use the Manhattan New York Borough and The Downtown-Toronto borough.

We removed any unassigned data from both datasets. The wikipedia HTML data set had new line characters that needed to be removed. Also I need to create the same column names in the dataframes that held the New York and Toronto data

# Final Data Set
Once the data sets were prepared, I will concatenate both data sets into one neighborhood dataframe.

# Feature Selection
The features for the different neighborhoods would be the 5 top most frequented venues for each neighborhood.  I will use the mean frequency of visits to determine the to 5 venues.


## 1. Download Toronto City into Dataframe TO_neighborhoods

In [2]:
import pandas as pd # library for data analsysis
import requests # library to handle requests


In [3]:
url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
htmltext = url.text
print('XML Downloaded')

XML Downloaded


In [4]:
from bs4 import BeautifulSoup
import bs4

soup = bs4.BeautifulSoup(htmltext, 'lxml')
table = soup.find('table','wikitable sortable')
#print (table)

In [5]:
            n_columns = 0
            n_rows=0
            column_names = []
    
            # Find number of rows and columns
            # we also find the column titles if we can
            for row in table.find_all('tr'):
                
                # Determine the number of rows in the table
                td_tags = row.find_all('td')
                if len(td_tags) > 0:
                    n_rows+=1
                    if n_columns == 0:
                        # Set the number of columns for our table
                        n_columns = len(td_tags)
                        
                # Handle column names if we find them
                th_tags = row.find_all('th') 
                if len(th_tags) > 0 and len(column_names) == 0:
                    for th in th_tags:
                        column_names.append(th.get_text().replace('\n', ''))
    
            # Safeguard on Column Titles
            if len(column_names) > 0 and len(column_names) != n_columns:
                raise Exception("Column titles do not match the number of columns")
            print (column_names)
            columns = column_names if len(column_names) > 0 else range(0,n_columns)
            df = pd.DataFrame(columns = columns,
                              index= range(0,n_rows))
            row_marker = 0
            for row in table.find_all('tr'):
                column_marker = 0
                columns = row.find_all('td')
                for column in columns:
                    df.iat[row_marker,column_marker] = column.get_text()
                    column_marker += 1
                if len(columns) > 0:
                    row_marker += 1
                    
df = df.replace('\n','', regex=True)
df = df[df.Borough != "Not assigned"]
df.head()

['Postcode', 'Borough', 'Neighbourhood']


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [6]:
!wget -O Geospatial_data.csv http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
    
    

--2019-03-25 00:09:12--  http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv [following]
--2019-03-25 00:09:12--  https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-03-25 00:09:12--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following

In [7]:
Geospatial_data_df = pd.read_csv("Geospatial_data.csv")

Geospatial_data_df.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
TO_neighborhoods = pd.merge(df, Geospatial_data_df, on='Postcode')

TO_neighborhoods.rename(columns={'Postcode': 'City'}, inplace=True)
TO_neighborhoods.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)
TO_neighborhoods['City'] = 'TO'
TO_neighborhoods.head()

downtown_toronto_data = TO_neighborhoods[TO_neighborhoods['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_toronto_data['Borough'] = 'TO_' + downtown_toronto_data['Borough']
downtown_toronto_data['Neighborhood'] = 'TO_' + downtown_toronto_data['Neighborhood'] 
downtown_toronto_data.head()

Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,TO,TO_Downtown Toronto,TO_Harbourfront,43.65426,-79.360636
1,TO,TO_Downtown Toronto,TO_Regent Park,43.65426,-79.360636
2,TO,TO_Downtown Toronto,TO_Ryerson,43.657162,-79.378937
3,TO,TO_Downtown Toronto,TO_Garden District,43.657162,-79.378937
4,TO,TO_Downtown Toronto,TO_St. James Town,43.651494,-79.375418


<a id='item1'></a>

## 1. Download New York City into Dataframe NY_neighborhoods

In [8]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


In [9]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

with open('newyork_data.json') as NY_json_data:
    NY_data = json.load(NY_json_data)
    
# Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. 
# So, let's define a new variable that includes this data   

NY_features_data = NY_data['features']

# Tranform the JSON data into a pandas dataframe
# define the dataframe columns
column_names = ['City', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
NY_neighborhoods = pd.DataFrame(columns=column_names)

for data in NY_features_data:
    city = 'NY'
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    NY_neighborhoods = NY_neighborhoods.append({'City': city,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

print ("NY Loaded")

NY Loaded


In [10]:
NY_neighborhoods.head()
NY_neighborhoods.shape

(306, 5)

need to just load manhattan since you can only make so many requests to Foursqare API

In [11]:
manhattan_data = NY_neighborhoods[NY_neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)

In [12]:
manhattan_data['Borough'] = 'NY_' + manhattan_data['Borough'] 
manhattan_data['Neighborhood'] = 'NY_' + manhattan_data['Neighborhood'] 
manhattan_data.head()


Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,NY,NY_Manhattan,NY_Marble Hill,40.876551,-73.91066
1,NY,NY_Manhattan,NY_Chinatown,40.715618,-73.994279
2,NY,NY_Manhattan,NY_Washington Heights,40.851903,-73.9369
3,NY,NY_Manhattan,NY_Inwood,40.867684,-73.92121
4,NY,NY_Manhattan,NY_Hamilton Heights,40.823604,-73.949688


## 1. Concatenate the NY neighborhoods with Toronto neighborhoods into neighborhoods

In [13]:
TO_neighborhoods.reset_index()
neighborhood_data = pd.concat([manhattan_data, downtown_toronto_data], axis=0).reset_index()
neighborhood_data.head()

Unnamed: 0,index,City,Borough,Neighborhood,Latitude,Longitude
0,0,NY,NY_Manhattan,NY_Marble Hill,40.876551,-73.91066
1,1,NY,NY_Manhattan,NY_Chinatown,40.715618,-73.994279
2,2,NY,NY_Manhattan,NY_Washington Heights,40.851903,-73.9369
3,3,NY,NY_Manhattan,NY_Inwood,40.867684,-73.92121
4,4,NY,NY_Manhattan,NY_Hamilton Heights,40.823604,-73.949688


In [14]:
neighborhood_data.tail()

Unnamed: 0,index,City,Borough,Neighborhood,Latitude,Longitude
72,32,TO,TO_Downtown Toronto,TO_Cabbagetown,43.667967,-79.367675
73,33,TO,TO_Downtown Toronto,TO_St. James Town,43.667967,-79.367675
74,34,TO,TO_Downtown Toronto,TO_First Canadian Place,43.648429,-79.38228
75,35,TO,TO_Downtown Toronto,TO_Underground city,43.648429,-79.38228
76,36,TO,TO_Downtown Toronto,TO_Church and Wellesley,43.66586,-79.38316


In [15]:
neighborhood_data.shape

(77, 6)

<a id='item2'></a>

<a id='item3'></a>

<a id='item4'></a>

<a id='item5'></a>