# IBM Data Science Coursera Capstone Project

# Table of Contents
   * [Introduction: Business Problem](#introduction)
   * [Data](#data)
   * [Methodology](#methodology)
   * [Analysis](#analysis)
   * [Results and Discussion](#results)
   * [Conclusion](#conclusion)

# Introduction: Business Problem<a id='introduction'></a>

   A Fortune 500 company is looking to move its headquarters to either Toronto or New York City. The company wants insight into the neighborhoods and local businesses in the cities so that its employees may have the optimum living standards and quality of life. 
    
   This project will attempt to explore patterns of neighborhoods between Toronto, Canada and New York City, New York by categorizing them into clusters in order to identify existing similarities and dissimilarities between certain neighborhoods in the two cities, and determine which neighborhoods best fit the culture of the Fortune 500 company’s employees.   
   
   From there on, recommendations can be made on which neighborhood will be most suitable for the company to make a decision.  

# Data<a id='data'></a>
    
   The data used for this project will be acquired from the respective cities Wikipedia website pages.   
   
   The datasets consists of the postal codes, neighborhood names, latitude, and longitude information for each neighborhood. Foursquare API search feature will be used to collect neighborhood venue information. Details about local venues and locality will be provide insight into the qualities of a neighborhood. In addition to Foursquare, various python packages will be used to create maps and machine learning models to further provide insights into our neighborhood battle project.
    
   In summary, the following data is required to meet the objective:
   
   - Toronto Neighborhoods - https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.
   - Toronto Latitude and Longitude - http://cocl.us/Geospatial_data
   - New York City neighborhoods - https://geo.nyu.edu/catalog/nyu_2451_34572
   - New York City Latitude and Longitude = Python Geolibrar

# Methodology<a id='methodology'></a>

**_Work Flow_**  

   1. HTTP requests would be made to this Foursquare API server using zip codes of the Seattle city neighborhoods to pull the location information (Latitude and Longitude).
   2. Foursquare API search feature would be enabled to collect the nearby places of the neighborhoods. Due to http request limitations the number of places per neighborhood parameter would reasonably be set to 100 and the radius parameter would be set to 700.
   3. Folium- Python visualization library would be used to visualize the neighborhoods cluster distribution of Seattle city over an interactive leaflet map.
   4. Extensive comparative analysis of two randomly picked neighborhoods world be carried out to derive the desirable insights from the outcomes using python’s scientific libraries Pandas, NumPy and Scikit-learn.
   5. Unsupervised machine learning algorithm K-mean clustering would be applied to form the clusters of different categories of places residing in and around the neighborhoods. These clusters from each of those two chosen neighborhoods would be analyzed individually collectively and comparatively to derive the conclusions.

**_The following are the Python packages_**  

   - Pandas - Library for Data Analysis
   - NumPy – Library to handle data in a vectorized manner
   - JSON – Library to handle JSON files
   - Geopy – To retrieve Location Data
   - Requests – Library to handle http requests
   - Matplotlib – Python Plotting Module
   - Sklearn – Python machine learning Library
   - Folium – Map rendering Library

# Analysis<a id='analysis'></a>

## Initialization
    
   Initialize required library.

In [58]:
# Load needed libraries for data collection

# HTML request and scraper library
!pip install beautifulsoup4
!pip install lxml
import requests
from bs4 import BeautifulSoup

# Geocoding library
!conda install -c conda-forge geopy --yes # Unquote to install geopy
from geopy.geocoders import ArcGIS # module to convert an address into latitude and longitude values

# Library for data analysis
import pandas as pd
from pandas import json_normalize # Function to transform json
import numpy as np

!conda install -c conda-forge folium=0.5.0 --yes # Unquote to install folium
import folium # map plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import collapsible JSON for exploration
from IPython.display import JSON

# k-means for categorization
from sklearn.cluster import KMeans

# Pretty print
from pprint import pprint

print('Libary Imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libary Imported.


## Data Gathering of Toronto

----
*The dateset being used is found at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.*

*The dataset is a list of Toronto's zipcodes which includes the boroughs and neighborhood names.*

---

In [2]:
#Obtain Postal Code, Borough, and Neighborhood information from Wikipedia
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header = 0)

#Obtain the first table
df_toronto = table[0]
df_toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


**Transform the data**

In [17]:
df_toronto.rename(columns = {"Postal code": "Postal Code", "Neighbourhood": "Neighborhood"}, inplace = True)

#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df_toronto.drop(df_toronto[df_toronto.Borough == 'Not assigned'].index, inplace=True)
#df.head()

#Combine the neighborhoods that exists in one postal code
df_toronto = df_toronto.groupby(['Postal Code', 'Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()
#df.head()

#Change unassigned Neighborhood to its Borough's name
df_toronto.loc[85,'Neighborhood'] = 'Queen\'s Park'

print (df_toronto.shape)

df_toronto.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


**Join neighborhood table with latitude and longitude information**

In [19]:
#Create a dataframe of the latitude and longitudes of the Toronto Neighborhoods
latlong = pd.read_csv("http://cocl.us/Geospatial_data")
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [20]:
print(latlong.shape)
latlong.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [21]:
latlong.tail()

Unnamed: 0,Postal Code,Latitude,Longitude
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437
102,M9W,43.706748,-79.594054


**Join latitude and longitude dataframe with neighborhood dataframe**

In [22]:
#Join the Lat and Long dataframe to Neighborhoods dataframe
df_toronto.set_index("Postal Code")
latlong.set_index("Postal Code")
neighbor=pd.merge(df_toronto, latlong)
neighbor.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [23]:
print('Toronto has {} boroughs and {} neighborhoods.'.format(
        len(neighbor['Borough'].unique()),
        neighbor.shape[0]
    )
)

Toronto has 10 boroughs and 103 neighborhoods.


**Use geopy library to get the latitude and longitude values of Toronto, Canada**

In [26]:
import geocoder
from geopy.geocoders import Nominatim
print('Imported')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    click-7.1.1                |     pyh8c360ce_0          64 KB  conda-forge
    future-0.18.2              |   py36h9f0ad1d_1         714 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         837 KB

The following NEW packages will be INSTALLED:

  click              conda-forge/noarch::click-7.1.1-pyh8c360ce_0
  decorator          conda-forge/noarch::decorator-4.4.2-py_0
  future             conda-forg

In [27]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="Canada_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.6534817, -79.3839347.


**Create a map of Toronto with neighborhoods superimposed on top**

In [28]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighbor['Latitude'], neighbor['Longitude'], neighbor['Borough'], neighbor['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Data Gathering of Scarborough, Toronto

Let's simplify the above map and segment and cluster only the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of the **Scarborough** Neighborhood data.

In [31]:
scarborough_data = neighbor[neighbor['Borough'] == 'Scarborough'].reset_index(drop=True)
print(scarborough_data.shape)
scarborough_data.head()

(17, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Lets get the geographical coordinates of Scarborough

In [32]:
address = 'Scarborough, Toronto'

geolocator = Nominatim(user_agent="Canada_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Scarborough, CA are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Scarborough, CA are 43.773077, -79.257774.


In [33]:
# create map of Scarborough using latitude and longitude values
map_scarborough = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(scarborough_data['Latitude'], scarborough_data['Longitude'], scarborough_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scarborough)  
    
map_scarborough

## Explore Scarborough neighborhood in Toronto with Foursquare API

Add all credentials.

In [34]:
CLIENT_ID = '20H0EVL42BOXDAMD3SCWXPIVK4DJRKOMZCWXINUNZYDAPLAL' # your Foursquare ID
CLIENT_SECRET = 'WMR0PF32PVP4PJQ11XD5Q2QFLUAIZ2KII4P5OITTACVA2ZP1' # your Foursquare Secret
VERSION = '20200427'
LIMIT = 100
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: 20H0EVL42BOXDAMD3SCWXPIVK4DJRKOMZCWXINUNZYDAPLAL
CLIENT_SECRET:WMR0PF32PVP4PJQ11XD5Q2QFLUAIZ2KII4P5OITTACVA2ZP1


**Function to explore neighborhoods**

In [36]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('Found {} venues in {} neighborhoods.'.format(nearby_venues.shape[0], len(venues_list)))
    
    return(nearby_venues)

print("It's ready.")

It's ready.


In [37]:
# Check the venues in Scarborough
scarborough_venues = getNearbyVenues(names=scarborough_data['Neighborhood'],
                                   latitudes=scarborough_data['Latitude'],
                                   longitudes=scarborough_data['Longitude']
                                  )

Found 95 venues in 17 neighborhoods.


In [38]:
print(scarborough_venues.shape)
scarborough_venues.head()

(95, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Malvern / Rouge,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,Guildwood / Morningside / West Hill,43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
4,Guildwood / Morningside / West Hill,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


In [39]:
#Venues per Neighborhood
scarborough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
Birch Cliff / Cliffside West,4,4,4,4,4,4
Cedarbrae,8,8,8,8,8,8
Clarks Corners / Tam O'Shanter / Sullivan,12,12,12,12,12,12
Cliffside / Cliffcrest / Scarborough Village West,3,3,3,3,3,3
Dorset Park / Wexford Heights / Scarborough Town Centre,5,5,5,5,5,5
Golden Mile / Clairlea / Oakridge,10,10,10,10,10,10
Guildwood / Morningside / West Hill,7,7,7,7,7,7
Kennedy Park / Ionview / East Birchmount Park,6,6,6,6,6,6
Malvern / Rouge,1,1,1,1,1,1


In [40]:
# Check the how many unique categories
print('There are {} distinct venues in {} categories.'.format(
    len(scarborough_venues['Venue'].unique()),len(scarborough_venues['Venue Category'].unique())))

#print('There are {} uniques categories.'.format(len(scarborough_venues['Venue Category'].unique())))

There are 84 distinct venues in 58 categories.


## Analyze each Neighborhood

In [41]:
# one hot encoding
scarborough_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scarborough_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])
scarborough_onehot = scarborough_onehot[fixed_columns]

scarborough_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bubble Tea Shop,Bus Line,...,Playground,Rental Car Location,Sandwich Place,Shopping Mall,Skating Rink,Soccer Field,Supermarket,Thai Restaurant,Train Station,Vietnamese Restaurant
0,Malvern / Rouge,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Rouge Hill / Port Union / Highland Creek,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Rouge Hill / Port Union / Highland Creek,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Guildwood / Morningside / West Hill,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Guildwood / Morningside / West Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [42]:
scarborough_grouped = scarborough_onehot.groupby('Neighborhood').mean().reset_index()
scarborough_grouped

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bubble Tea Shop,Bus Line,...,Playground,Rental Car Location,Sandwich Place,Shopping Mall,Skating Rink,Soccer Field,Supermarket,Thai Restaurant,Train Station,Vietnamese Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,...,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0
1,Birch Cliff / Cliffside West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
2,Cedarbrae,0.0,0.125,0.0,0.125,0.125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0
3,Clarks Corners / Tam O'Shanter / Sullivan,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0
4,Cliffside / Cliffcrest / Scarborough Village West,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Dorset Park / Wexford Heights / Scarborough To...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2
6,Golden Mile / Clairlea / Oakridge,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0
7,Guildwood / Morningside / West Hill,0.0,0.0,0.0,0.0,0.142857,0.0,0.142857,0.0,0.0,...,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Kennedy Park / Ionview / East Birchmount Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0
9,Malvern / Rouge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Each neighborhood with the top 5 venues

In [43]:
num_top_venues = 10

for hood in scarborough_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scarborough_grouped[scarborough_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0  Latin American Restaurant   0.2
1             Breakfast Spot   0.2
2               Skating Rink   0.2
3                     Lounge   0.2
4             Clothing Store   0.2
5        American Restaurant   0.0
6               Noodle House   0.0
7               Intersection   0.0
8         Italian Restaurant   0.0
9          Korean Restaurant   0.0


----Birch Cliff / Cliffside West----
                       venue  freq
0            College Stadium  0.25
1               Skating Rink  0.25
2      General Entertainment  0.25
3                       Café  0.25
4                       Park  0.00
5         Italian Restaurant  0.00
6          Korean Restaurant  0.00
7  Latin American Restaurant  0.00
8                     Lounge  0.00
9             Medical Center  0.00


----Cedarbrae----
                  venue  freq
0                Bakery  0.12
1                  Bank  0.12
2       Thai Restaurant  0.12
3    Athletics & Sports  0.12
4  

## Put into a pandas dataframe

Let's write a function to sort the venues in descending order

In [45]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
print("It's ready.")

It's ready.


Create a new dataframe and display the top ten venues for each neighborhood

In [46]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scarborough_grouped['Neighborhood']

for ind in np.arange(scarborough_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scarborough_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Clothing Store,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Coffee Shop,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
1,Birch Cliff / Cliffside West,General Entertainment,Skating Rink,Café,College Stadium,Vietnamese Restaurant,Coffee Shop,Grocery Store,Gas Station,Fried Chicken Joint,Fast Food Restaurant
2,Cedarbrae,Gas Station,Thai Restaurant,Athletics & Sports,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Fried Chicken Joint,Electronics Store,Department Store
3,Clarks Corners / Tam O'Shanter / Sullivan,Pizza Place,Italian Restaurant,Bank,Noodle House,Pharmacy,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Fried Chicken Joint,Gas Station
4,Cliffside / Cliffcrest / Scarborough Village West,American Restaurant,Motel,Movie Theater,Auto Garage,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
5,Dorset Park / Wexford Heights / Scarborough To...,Indian Restaurant,Vietnamese Restaurant,Pet Store,Chinese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
6,Golden Mile / Clairlea / Oakridge,Bakery,Bus Line,Metro Station,Soccer Field,Ice Cream Shop,Intersection,Park,Bus Station,Discount Store,Construction & Landscaping
7,Guildwood / Morningside / West Hill,Rental Car Location,Breakfast Spot,Medical Center,Electronics Store,Intersection,Mexican Restaurant,Bank,Fast Food Restaurant,College Stadium,Fried Chicken Joint
8,Kennedy Park / Ionview / East Birchmount Park,Coffee Shop,Bus Station,Hobby Shop,Discount Store,Department Store,Train Station,Bar,Breakfast Spot,Athletics & Sports,Gym
9,Malvern / Rouge,Fast Food Restaurant,Vietnamese Restaurant,Coffee Shop,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store,Discount Store


In [47]:
neighborhoods_venues_sorted.iloc[11,]

Neighborhood              Rouge Hill / Port Union / Highland Creek
1st Most Common Venue                               History Museum
2nd Most Common Venue                                          Bar
3rd Most Common Venue                                  Coffee Shop
4th Most Common Venue                                          Gym
5th Most Common Venue                                Grocery Store
6th Most Common Venue                        General Entertainment
7th Most Common Venue                                  Gas Station
8th Most Common Venue                          Fried Chicken Joint
9th Most Common Venue                         Fast Food Restaurant
10th Most Common Venue                           Electronics Store
Name: 11, dtype: object

## Cluster the Scarborough Neighborhood using k-means

Run K-means to cluster neighborhood into three clusters

In [48]:
# set number of clusters
kclusters = 3

scarborough_grouped_clustering = scarborough_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=2).fit(scarborough_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:10] 
kmeans.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top ten venues for each neighborhood

In [49]:
#Note that the neighborhood Upper Rouge does not have any venues, so I will drop from dataset
scarborough_data.drop(scarborough_data[scarborough_data.Neighborhood == 'Upper Rouge'].index, inplace = True)
#df_toronto.drop(df_toronto[df_toronto.Borough == 'Not assigned'].index, inplace=True)

scarborough_merged = scarborough_data

# add clustering labels
scarborough_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
scarborough_merged = scarborough_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

scarborough_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353,1,Fast Food Restaurant,Vietnamese Restaurant,Coffee Shop,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store,Discount Store
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497,1,History Museum,Bar,Coffee Shop,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711,1,Rental Car Location,Breakfast Spot,Medical Center,Electronics Store,Intersection,Mexican Restaurant,Bank,Fast Food Restaurant,College Stadium,Fried Chicken Joint
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop,Korean Restaurant,Vietnamese Restaurant,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Discount Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Gas Station,Thai Restaurant,Athletics & Sports,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Fried Chicken Joint,Electronics Store,Department Store


## Visualize the map

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarborough_merged['Latitude'], scarborough_merged['Longitude'], scarborough_merged['Neighborhood'], scarborough_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Scarborough Neighborhood Cluster

Examine each cluster and determine the discriminating venue categories that distinguish each cluster

In [52]:
# Scarborough Clusters 0, 1, 2
scarborough_cluster_0 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 0, scarborough_merged.columns[[1] + list(range(4, scarborough_merged.shape[1]))]]

scarborough_cluster_1 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 1, scarborough_merged.columns[[1] + list(range(4, scarborough_merged.shape[1]))]]

scarborough_cluster_2 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 2, scarborough_merged.columns[[1] + list(range(4, scarborough_merged.shape[1]))]]

print('Done.')

Done.


In [53]:
scarborough_cluster_0

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Scarborough,-79.273304,0,Indian Restaurant,Vietnamese Restaurant,Pet Store,Chinese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
15,Scarborough,-79.318389,0,Chinese Restaurant,Fast Food Restaurant,Grocery Store,Breakfast Spot,Gym,Discount Store,Electronics Store,Bubble Tea Shop,Pharmacy,Pizza Place


In [54]:
scarborough_cluster_1

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,-79.194353,1,Fast Food Restaurant,Vietnamese Restaurant,Coffee Shop,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store,Discount Store
1,Scarborough,-79.160497,1,History Museum,Bar,Coffee Shop,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,Scarborough,-79.188711,1,Rental Car Location,Breakfast Spot,Medical Center,Electronics Store,Intersection,Mexican Restaurant,Bank,Fast Food Restaurant,College Stadium,Fried Chicken Joint
3,Scarborough,-79.216917,1,Coffee Shop,Korean Restaurant,Vietnamese Restaurant,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Discount Store
4,Scarborough,-79.239476,1,Gas Station,Thai Restaurant,Athletics & Sports,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Fried Chicken Joint,Electronics Store,Department Store
5,Scarborough,-79.239476,1,Playground,Convenience Store,Construction & Landscaping,Vietnamese Restaurant,Clothing Store,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
6,Scarborough,-79.262029,1,Coffee Shop,Bus Station,Hobby Shop,Discount Store,Department Store,Train Station,Bar,Breakfast Spot,Athletics & Sports,Gym
7,Scarborough,-79.284577,1,Bakery,Bus Line,Metro Station,Soccer Field,Ice Cream Shop,Intersection,Park,Bus Station,Discount Store,Construction & Landscaping
8,Scarborough,-79.239476,1,American Restaurant,Motel,Movie Theater,Auto Garage,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
11,Scarborough,-79.295849,1,Middle Eastern Restaurant,Bakery,Sandwich Place,Shopping Mall,Breakfast Spot,Auto Garage,Gas Station,General Entertainment,Fried Chicken Joint,Coffee Shop


In [55]:
scarborough_cluster_2

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Scarborough,-79.264848,2,General Entertainment,Skating Rink,Café,College Stadium,Vietnamese Restaurant,Coffee Shop,Grocery Store,Gas Station,Fried Chicken Joint,Fast Food Restaurant


## Data Gathering of New York

New York City has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

Luckily, this dataset exists for free on the web. The link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

In [56]:
!wget -q -O 'newyork_data.json' https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
print('Data downloaded!')

Data downloaded!


Load and explore dataset

In [64]:
import json # library to handle JSON files

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

print('Done.')

Done.


Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [65]:
neighborhoods_data = newyork_data['features']

In [66]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

## Transform the data into a pandas dataframe

In [67]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [68]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [70]:
print(neighborhoods.shape)
neighborhoods.head()

(306, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [71]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## Use Geolibrary to get the latitude and longitude of New York City

In [72]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="NY_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [73]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Lets simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Queens neighborhood data.

In [74]:
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
queens_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


Lets get the geographical location of Queens, NY

In [77]:
address = 'Queens, NY'

geolocator = Nominatim(user_agent="Queen's explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Queens are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Queens are 40.7498243, -73.7976337.


In [78]:
# create map of Manhattan using latitude and longitude values
map_queens = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(queens_data['Latitude'], queens_data['Longitude'], queens_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_queens)  
    
map_queens

## Lets explore the Long Island City neighborhood in Queens, NY

In [79]:
queens_data.loc[10, 'Neighborhood']

'Long Island City'

In [80]:
#Long Island City Latitude and Longitude values

neighborhood_latitude = queens_data.loc[10, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = queens_data.loc[10, 'Longitude'] # neighborhood longitude value

neighborhood_name = queens_data.loc[10, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Long Island City are 40.75021734610528, -73.93920223915505.


## Top 100 venues in Long Island City neighborhood within a radius of 500 meters

First, let's create the GET request URL named url.

In [81]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=20H0EVL42BOXDAMD3SCWXPIVK4DJRKOMZCWXINUNZYDAPLAL&client_secret=WMR0PF32PVP4PJQ11XD5Q2QFLUAIZ2KII4P5OITTACVA2ZP1&v=20200427&ll=40.75021734610528,-73.93920223915505&radius=500&limit=100'

In [82]:
#Send the GET request
results = requests.get(url).json()

In [83]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [84]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Chip,Dessert Shop,40.750069,-73.940831
1,Etto Espresso Bar,Coffee Shop,40.748703,-73.940689
2,Hilton Garden Inn,Hotel,40.750216,-73.936886
3,Dutch Kills,Cocktail Bar,40.74783,-73.940108
4,Brooklyn Boulders Queensbridge,Climbing Gym,40.752649,-73.94001


In [85]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

69 venues were returned by Foursquare.


## Analyze Each Neighborhood in Queens

In [86]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('Found {} venues in {} neighborhoods.'.format(nearby_venues.shape[0], len(venues_list)))
    
    return(nearby_venues)

In [88]:
queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude']
                               )

Found 2073 venues in 81 neighborhoods.


In [89]:
queens_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
1,Astoria,40.768509,-73.915654,Orange Blossom,40.769856,-73.917012,Gourmet Shop
2,Astoria,40.768509,-73.915654,Titan Foods Inc.,40.769198,-73.919253,Gourmet Shop
3,Astoria,40.768509,-73.915654,CrossFit Queens,40.769404,-73.918977,Gym
4,Astoria,40.768509,-73.915654,Off The Hook,40.7672,-73.918104,Seafood Restaurant


In [90]:
print(queens_venues.shape)
queens_venues.tail()

(2073, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2068,Queensbridge,40.756091,-73.945631,Roosevelt Island Running Path,40.754902,-73.948958,Gym / Fitness Center
2069,Queensbridge,40.756091,-73.945631,The Ravel Hotel Gym,40.753787,-73.948815,Athletics & Sports
2070,Queensbridge,40.756091,-73.945631,Profundo Pool Club,40.753719,-73.948878,Hotel Pool
2071,Queensbridge,40.756091,-73.945631,Estate Garden And Grill,40.7537,-73.948841,Beer Garden
2072,Queensbridge,40.756091,-73.945631,Track 114,40.753008,-73.947833,Platform


In [91]:
#Venues per Neighborhood
queens_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arverne,18,18,18,18,18,18
Astoria,97,97,97,97,97,97
Astoria Heights,12,12,12,12,12,12
Auburndale,18,18,18,18,18,18
Bay Terrace,37,37,37,37,37,37
...,...,...,...,...,...,...
Sunnyside Gardens,100,100,100,100,100,100
Utopia,15,15,15,15,15,15
Whitestone,4,4,4,4,4,4
Woodhaven,23,23,23,23,23,23


In [92]:
print('There are {} distinct venues in {} categories.'.format(
    len(queens_venues['Venue'].unique()),len(queens_venues['Venue Category'].unique())))

There are 1705 distinct venues in 269 categories.


In [93]:
# one hot encoding
queens_onehot = pd.get_dummies(queens_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
queens_onehot['Neighborhood'] = queens_venues['Neighborhood'] 

# move neighborhood column to the first column
#fixed_columns = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])
#scarborough_onehot = scarborough_onehot[fixed_columns]

neighbor = queens_onehot['Neighborhood']
queens_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
queens_onehot.insert(0, 'Neighborhood', neighbor)

queens_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport Terminal,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Astoria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Astoria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Astoria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Astoria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Astoria,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Group by Neighborhood, and exame the frequency of the occurrence of venue

In [95]:
queens_grouped = queens_onehot.groupby('Neighborhood').mean().reset_index()
queens_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport Terminal,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Arverne,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.00,0.055556,0.000000,0.0
1,Astoria,0.000000,0.000000,0.0,0.010309,0.000000,0.0,0.0,0.0,0.0,...,0.010309,0.000000,0.0,0.0,0.0,0.000000,0.00,0.010309,0.000000,0.0
2,Astoria Heights,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.000000,0.0
3,Auburndale,0.000000,0.000000,0.0,0.055556,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.000000,0.0
4,Bay Terrace,0.027027,0.000000,0.0,0.054054,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.027027,0.0,0.0,0.0,0.027027,0.00,0.000000,0.054054,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,Sunnyside Gardens,0.000000,0.000000,0.0,0.030000,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.010000,0.0,0.0,0.0,0.000000,0.01,0.000000,0.000000,0.0
77,Utopia,0.000000,0.066667,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.000000,0.0
78,Whitestone,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.000000,0.0
79,Woodhaven,0.000000,0.000000,0.0,0.000000,0.043478,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.000000,0.0


## Each Neighborhood with the top 5 venues

In [96]:
num_top_venues = 5

for hood in queens_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = queens_grouped[queens_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Arverne----
             venue  freq
0        Surf Spot  0.22
1    Metro Station  0.11
2   Sandwich Place  0.11
3  Thai Restaurant  0.06
4      Pizza Place  0.06


----Astoria----
                       venue  freq
0  Middle Eastern Restaurant  0.06
1                        Bar  0.06
2                 Hookah Bar  0.05
3           Greek Restaurant  0.04
4   Mediterranean Restaurant  0.04


----Astoria Heights----
           venue  freq
0  Bowling Alley  0.08
1     Playground  0.08
2   Burger Joint  0.08
3   Liquor Store  0.08
4         Bakery  0.08


----Auburndale----
                  venue  freq
0    Italian Restaurant  0.06
1                 Train  0.06
2              Pharmacy  0.06
3             Pet Store  0.06
4  Fast Food Restaurant  0.06


----Bay Terrace----
                 venue  freq
0       Clothing Store  0.11
1        Women's Store  0.05
2  American Restaurant  0.05
3           Kids Store  0.05
4       Lingerie Store  0.05


----Bayside----
                 venue  fre

In [97]:
#Function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

## Top venues for each neighborhood

In [98]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = queens_grouped['Neighborhood']

for ind in np.arange(queens_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(queens_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arverne,Surf Spot,Metro Station,Sandwich Place,Thai Restaurant,Pizza Place,Donut Shop,Coffee Shop,Bus Stop,Board Shop,Bed & Breakfast
1,Astoria,Middle Eastern Restaurant,Bar,Hookah Bar,Mediterranean Restaurant,Greek Restaurant,Seafood Restaurant,Pizza Place,Bakery,Food Truck,Ice Cream Shop
2,Astoria Heights,Plaza,Bakery,Bus Station,Bowling Alley,Shopping Mall,Supermarket,Liquor Store,Burger Joint,Playground,Pizza Place
3,Auburndale,Mobile Phone Shop,Train,Sushi Restaurant,Supermarket,Miscellaneous Shop,Bar,Korean Restaurant,Fast Food Restaurant,Furniture / Home Store,Toy / Game Store
4,Bay Terrace,Clothing Store,Women's Store,Cosmetics Shop,Donut Shop,American Restaurant,Lingerie Store,Shoe Store,Kids Store,Mobile Phone Shop,Deli / Bodega
...,...,...,...,...,...,...,...,...,...,...,...
76,Sunnyside Gardens,Bar,Grocery Store,Pizza Place,Coffee Shop,Korean Restaurant,Thai Restaurant,American Restaurant,Mexican Restaurant,Pharmacy,Turkish Restaurant
77,Utopia,Deli / Bodega,Basketball Court,Indie Movie Theater,History Museum,Spa,Bakery,Donut Shop,Automotive Shop,South American Restaurant,Afghan Restaurant
78,Whitestone,Dance Studio,Deli / Bodega,Bubble Tea Shop,Candy Store,Fish Market,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Food
79,Woodhaven,Pharmacy,Bank,Deli / Bodega,Donut Shop,Park,Sandwich Place,Nail Salon,Latin American Restaurant,Thai Restaurant,Fried Chicken Joint


In [99]:
neighborhoods_venues_sorted.iloc[47,]

Neighborhood                      Long Island City
1st Most Common Venue                        Hotel
2nd Most Common Venue                  Coffee Shop
3rd Most Common Venue                  Pizza Place
4th Most Common Venue                          Bar
5th Most Common Venue           Mexican Restaurant
6th Most Common Venue                         Café
7th Most Common Venue         Gym / Fitness Center
8th Most Common Venue                  Bus Station
9th Most Common Venue     Mediterranean Restaurant
10th Most Common Venue                  Steakhouse
Name: 47, dtype: object

## Cluster the Queens Borough using K-Means 

In [100]:
# set number of clusters
kclusters = 5

queens_grouped_clustering = queens_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=2).fit(queens_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:10] 
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

## Dataframe that includes the cluster of each neighborhood

In [101]:
queens_merged = queens_data

# add clustering labels
queens_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
queens_merged = queens_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

queens_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Queens,Astoria,40.768509,-73.915654,0,Middle Eastern Restaurant,Bar,Hookah Bar,Mediterranean Restaurant,Greek Restaurant,Seafood Restaurant,Pizza Place,Bakery,Food Truck,Ice Cream Shop
1,Queens,Woodside,40.746349,-73.901842,0,Grocery Store,Thai Restaurant,Bakery,Filipino Restaurant,Latin American Restaurant,Deli / Bodega,American Restaurant,Pub,Donut Shop,Bar
2,Queens,Jackson Heights,40.751981,-73.882821,0,Latin American Restaurant,Peruvian Restaurant,South American Restaurant,Bakery,Thai Restaurant,Mexican Restaurant,Mobile Phone Shop,Grocery Store,Kids Store,Empanada Restaurant
3,Queens,Elmhurst,40.744049,-73.881656,0,Thai Restaurant,Mexican Restaurant,South American Restaurant,Vietnamese Restaurant,Pizza Place,Colombian Restaurant,Chinese Restaurant,Bar,Malay Restaurant,Gym / Fitness Center
4,Queens,Howard Beach,40.654225,-73.838138,0,Italian Restaurant,Pharmacy,Clothing Store,Sandwich Place,Fast Food Restaurant,Bagel Shop,Mexican Restaurant,Diner,Chinese Restaurant,Supermarket


In [102]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(queens_merged['Latitude'], queens_merged['Longitude'], queens_merged['Neighborhood'], queens_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Queens Cluster Neighborhood

Examine each cluster and determine the discriminating venue categories that distinguish each cluster

In [104]:
queens_cluster_0 = queens_merged.loc[queens_merged['Cluster Labels'] == 0, queens_merged.columns[[1] + list(range(4, queens_merged.shape[1]))]]

queens_cluster_1 = queens_merged.loc[queens_merged['Cluster Labels'] == 1, queens_merged.columns[[1] + list(range(4, queens_merged.shape[1]))]]

queens_cluster_2 = queens_merged.loc[queens_merged['Cluster Labels'] == 2, queens_merged.columns[[1] + list(range(4, queens_merged.shape[1]))]]

queens_cluster_3 = queens_merged.loc[queens_merged['Cluster Labels'] == 3, queens_merged.columns[[1] + list(range(4, queens_merged.shape[1]))]]

queens_cluster_4 = queens_merged.loc[queens_merged['Cluster Labels'] == 4, queens_merged.columns[[1] + list(range(4, queens_merged.shape[1]))]]

print("It's ready.")

It's ready.


In [105]:
queens_cluster_0

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Astoria,0,Middle Eastern Restaurant,Bar,Hookah Bar,Mediterranean Restaurant,Greek Restaurant,Seafood Restaurant,Pizza Place,Bakery,Food Truck,Ice Cream Shop
1,Woodside,0,Grocery Store,Thai Restaurant,Bakery,Filipino Restaurant,Latin American Restaurant,Deli / Bodega,American Restaurant,Pub,Donut Shop,Bar
2,Jackson Heights,0,Latin American Restaurant,Peruvian Restaurant,South American Restaurant,Bakery,Thai Restaurant,Mexican Restaurant,Mobile Phone Shop,Grocery Store,Kids Store,Empanada Restaurant
3,Elmhurst,0,Thai Restaurant,Mexican Restaurant,South American Restaurant,Vietnamese Restaurant,Pizza Place,Colombian Restaurant,Chinese Restaurant,Bar,Malay Restaurant,Gym / Fitness Center
4,Howard Beach,0,Italian Restaurant,Pharmacy,Clothing Store,Sandwich Place,Fast Food Restaurant,Bagel Shop,Mexican Restaurant,Diner,Chinese Restaurant,Supermarket
...,...,...,...,...,...,...,...,...,...,...,...,...
76,Middle Village,0,Cosmetics Shop,Park,Playground,Sports Bar,Liquor Store,Sushi Restaurant,Chinese Restaurant,Sandwich Place,Baseball Field,Bank
77,Malba,0,Vegetarian / Vegan Restaurant,Rock Club,Rest Area,Tennis Court,Empanada Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
78,Hammels,0,Beach,Fried Chicken Joint,Gym / Fitness Center,Diner,Dog Run,Fast Food Restaurant,Shoe Store,Bus Stop,Bus Station,Building
79,Bayswater,0,Park,Playground,Fish & Chips Shop,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish Market


In [106]:
queens_cluster_1

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,Glendale,1,Deli / Bodega,Food & Drink Shop,Brewery,Pizza Place,Arts & Crafts Store,Chinese Restaurant,Food Court,Food,Food Stand,Flower Shop


In [107]:
queens_cluster_2

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,East Elmhurst,2,Donut Shop,Ice Cream Shop,Rental Car Location,Lake,Coffee Shop,Gas Station,Bus Station,Supermarket,Snack Place,Hotel Bar
31,Jamaica Center,2,Mobile Phone Shop,Sandwich Place,Pizza Place,Department Store,Performing Arts Venue,Mexican Restaurant,Coffee Shop,Clothing Store,Caribbean Restaurant,Sporting Goods Shop
52,Floral Park,2,Indian Restaurant,Dosa Place,Grocery Store,Pizza Place,Basketball Court,Food & Drink Shop,Food,Flower Shop,Fish Market,Fish & Chips Shop


In [108]:
queens_cluster_3

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Rego Park,3,Bakery,Restaurant,Bagel Shop,Donut Shop,Sandwich Place,Grocery Store,Sushi Restaurant,Pizza Place,Breakfast Spot,Martial Arts Dojo
43,Breezy Point,3,Monument / Landmark,Beach,Trail,Bus Stop,Fish Market,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Flower Shop


In [109]:
queens_cluster_4

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Forest Hills,4,Gym / Fitness Center,Gym,Yoga Studio,Pharmacy,Convenience Store,Park,Thai Restaurant,Pizza Place,Video Game Store,Gift Shop
69,Utopia,4,Deli / Bodega,Basketball Court,Indie Movie Theater,History Museum,Spa,Bakery,Donut Shop,Automotive Shop,South American Restaurant,Afghan Restaurant


# Result and Discussion<a id='results'></a>   

## Results  
**_Scarborough Borough in Toronto, Canada_**   
I use k-means to group the neighborhoods in Scarborough into 3 clusters.   
   - Cluster_0 has 15 neighborhoods and the most common venues are skating rinks, international cuisine restaurants and breakfast spots. 
   - Cluster 1 has 1 neighborhood 1 neighborhood , and the most common venues are pizza place and noodle house. 
   - Cluster 2 has 1 neighborhood, and the most common venues are Chinese restaurants and discount stores.  
   

   
**_Queens Borough in New York City_**  
I used k-means to group the Queens borough into 5 clusters.  
   - Cluster_0 has 81 neighborhoods and consist of many international cuisine restaurants and grocery stores. The most common venues are pizza places, deli, and Chinese restaurants.
   - Cluster_1 has 1 neighborhood and the most common venue is a dance studio.
   - Cluster_2 has 5 neighborhoods and the most common venue are donut shops and international cuisine restaurants. 
   - Cluster_3 has 2 neighborhoods and the most common venues are the beach and a bakery. 
   - Cluster_4 has 2 neighborhoods and the most common venues are gyms and donut shops.  
   
   ----
 
## Discussion
   - Toronto has 11 boroughs and 103 neighborhoods. The geographical coordinate of Toronto, Canada are 43.7170226, -79.4197830350134. In Scarborough borough, found 85 venues in 17 neighborhoods. In Scarborough borough, the neighborhoods with the most venues are L’Amoreaux West and Steeles West. There are 79 distinct venues in 50 categories.

   - New York City has 5 boroughs and 306 neighborhoods. The geographical coordinate of New York City are 40.7308619, -73.9871558. Foursquare found 2108 venues in 81 neighborhoods in Queens borough.  

   - Many of the neighborhoods are homogenous and are very similar to each other. Both Scarborough and Queens borough consist of neighborhood cluster that contain majority of the neighborhoods, and the remaining cluster had 1-5 neighborhoods. Queens borough had a significant more number of neighborhoods and venues than Scarborough. 

# Conclusion<a id='conclusion'></a>  

In conclusion, based on the quantity of venues and variety of venues, I would choose Queens over Scarborough as a choice to relocate the headquarters of the Fortune 500 company. Queens offer way more in choices for restaurants, gyms, grocery stores, and extracurricular activities for individuals and families of the company’s employees.