## Capstone Project- The battle of Neighborhoods

### By
### Samira Gholizadeh

### I. Introduction

##### One of the most common business problems that can affect the success of a business is location. Some hotels or cinemas are next to each other or close to highways, schools or shopping malls. There are some factors to influence the customer targeted group along with business purposes. Where each business should locates? Should they locate close to schools, highways or far away? Hotels are the most convenient place for tourists and visitors the stay. Before reservation, some people look at the price and locations, and others look at the amenities. 

### II. Business Problem

##### As the topic for our project should be related to "battle of neighborhoods", I've decided to compare two cities of the New York City. In this scenario, it is urgent to adopt machine learning tools in order to assist visitors to find their suitable hotels. As a result, the business problem we are currently posing is: how could we provide support to visitors to reserve a suitable hotel in New York in this uncertain economic and financial scenario? If any investor is going to build new hotel, where is the best location? 

##### To solve this business problem, we are going to cluster Brooklyn and Queen District of the New York City neighborhoods in order to recommend venues and the current average price of hotels where visitors can reserve their rooms. We will recommend profitable venues according to amenities and essential facilities surrounding such bars, restaurants, gym and finesses and etc.


### II.Data Section

##### Since we have already looked at New York and Toronto’s neighborhood data, I am going to accomplish my project on comparing two cities of New York. Besides of the New York neighborhoods dataset, I will use the population and capita of each neighborhood for the city. I used the neighborhoods dataset provided in the lab to bring in location (latitude, longitude) information of the two boroughs. And use FourSquare to generate maps. Data for capita and population were extracted from (https://en.wikipedia.org/wiki/Boroughs_of_New_York_City). 

### III.Methodology Section

##### The Methodology section will describe the main components of our analysis and predication system. The Methodology section comprises four stages:
1. Collect Inspection Data
2. Explore and Understand Data
3. Data preparation and preprocessing 
4. Modeling



#### 1. Collect Inspection Data

After importing the necessary libraries, we download the data from the HM Land Registry website as follows:

In [1]:
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.4.5.2 |       hecda079_0         147 KB  conda-forge
    certifi-2020.4.5.2         |   py36h9f0ad1d_0         152 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         395 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0

The following packages will b

I use the neighborhoods dataset provided in the lab to bring in location (latitude, longitude) information of the two boroughs. And use FourSquare to generate maps. Lets Use credentials from Foursquare first:

In [2]:
CLIENT_ID = '2TR1EPEGQC2C2DTPBKNBZ1DCZXQQDIAKU2DHRGXQWEE4HSBP' # your Foursquare ID
CLIENT_SECRET = 'CB2X3UTZBSL1TY3GTZ1KOBQWCQHIEOFO35ELJJCYPUZI4INH' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 300
print('Your credentails:')
print('2TR1EPEGQC2C2DTPBKNBZ1DCZXQQDIAKU2DHRGXQWEE4HSBP: ' + CLIENT_ID)
print('CB2X3UTZBSL1TY3GTZ1KOBQWCQHIEOFO35ELJJCYPUZI4INH:' + CLIENT_SECRET)

Your credentails:
2TR1EPEGQC2C2DTPBKNBZ1DCZXQQDIAKU2DHRGXQWEE4HSBP: 2TR1EPEGQC2C2DTPBKNBZ1DCZXQQDIAKU2DHRGXQWEE4HSBP
CB2X3UTZBSL1TY3GTZ1KOBQWCQHIEOFO35ELJJCYPUZI4INH:CB2X3UTZBSL1TY3GTZ1KOBQWCQHIEOFO35ELJJCYPUZI4INH


#### Then I have to collect datsets from the source.

In [5]:
#source from: https://en.wikipedia.org/wiki/Boroughs_of_New_York_City

from collections import OrderedDict
stac = {'Borough':["Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"],
        'Population':[1471160, 2648771, 1664727, 2358582, 479458],
        'GDP':[28.787, 63.303, 629.682, 73.842, 11.249],
        'per capita':[19570, 23900, 378250, 31310, 23460],
        'square miles':[42.1, 70.82, 22.83, 108.53, 58.37],
        'persons /sq.mi':[34653, 37137, 72033, 21460, 8112]}
df_NYC = pd.DataFrame.from_dict(stac)

Before using data, we will have to explore and understand it.

#### 2. Explore and Understand Data

The following table shows the population and the capita of the New York.

In [6]:
df_NYC

Unnamed: 0,Borough,Population,GDP,per capita,square miles,persons /sq.mi
0,Bronx,1471160,28.787,19570,42.1,34653
1,Brooklyn,2648771,63.303,23900,70.82,37137
2,Manhattan,1664727,629.682,378250,22.83,72033
3,Queens,2358582,73.842,31310,108.53,21460
4,Staten Island,479458,11.249,23460,58.37,8112


#### Now I need to load and explore the data. Since the data has already been dowloaded into the server, I just need to load it.  

In [7]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [8]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [9]:
neighborhoods_data = newyork_data['features']


#### 3. Data preparation and preprocessing

At this stage, we prepare our dataset for the modeling process, opting for the most suitable machine learning algorithm for our scope. Accordingly, we perform the following steps:

-Rename the column names

-Format the date column

-Drop cities based on too high or low population and GDP

-Select data only for two cities of New York

-Define the query of the serach 

-Define information of interest

In [10]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [11]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [12]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [13]:
df_NYgeo = pd.merge(neighborhoods, df_ny, on='Borough', how = 'left')
df_NYgeo.shape

(306, 9)

In [15]:
brooklyn_df = df_NYgeo.loc[df_NYgeo.Borough=="Brooklyn"].reset_index(drop=True)
brooklyn_df.shape

(70, 9)

In [16]:
address_1 = 'Brooklyn, NY'

geolocator_1 = Nominatim(user_agent="br_explorer")
location_1 = geolocator_1.geocode(address_1)
latitude_1 = location_1.latitude
longitude_1 = location_1.longitude
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(latitude_1, longitude_1))

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


####  let's define a query to search for Hotel that is within 5000 metres. Thereafter define the corresponding URL.

In [17]:
search_query = 'hotel' 
radius = 5000
print(search_query + ' .... OK!')

hotel .... OK!


In [18]:
url_1 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude_1, longitude_1, VERSION, search_query, radius, LIMIT)
url_1

'https://api.foursquare.com/v2/venues/search?client_id=2TR1EPEGQC2C2DTPBKNBZ1DCZXQQDIAKU2DHRGXQWEE4HSBP&client_secret=CB2X3UTZBSL1TY3GTZ1KOBQWCQHIEOFO35ELJJCYPUZI4INH&ll=40.6501038,-73.9495823&v=20180604&query=hotel&radius=5000&limit=300'

#### Lets send and exmine the result and also transform json into datframe:

In [19]:
results_1 = requests.get(url_1).json()

In [20]:
# assign relevant part of JSON to venues
venues_1 = results_1['response']['venues']

# tranform venues into a dataframe
dataframe_1 = json_normalize(venues_1)
dataframe_1.head()

  """


Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet,venuePage.id,location.neighborhood
0,5df458112d5b1200073cdbaa,"Brooklyn Vybe Hotel, Ascend Hotel Collection","[{'id': '4bf58dd8d48988d1fa931735', 'name': 'H...",v-1592143540,False,1024 Flatbush Avenue,40.645992,-73.958462,"[{'label': 'display', 'lat': 40.645992, 'lng':...",878,11226,US,Brooklyn,NY,United States,"[1024 Flatbush Avenue, Brooklyn, NY 11226, Uni...",,,
1,4bd72c97304fce72d32c33ab,Mayfair Hotel Jersey (Channel Islands),"[{'id': '4bf58dd8d48988d1fa931735', 'name': 'H...",v-1592143540,False,Brooklyn St,40.65,-73.95,"[{'label': 'display', 'lat': 40.65, 'lng': -73...",37,11226,US,Jersey,NY,United States,"[Brooklyn St, Jersey, NY 11226, United States]",,,
2,58804440ef46947925418207,Hotel RL Brooklyn,"[{'id': '4bf58dd8d48988d1fa931735', 'name': 'H...",v-1592143540,False,1080 Broadway,40.694416,-73.930972,"[{'label': 'display', 'lat': 40.69441592010015...",5176,11221,US,Brooklyn,NY,United States,"[1080 Broadway, Brooklyn, NY 11221, United Sta...",,,
3,49c3c85df964a52076561fe3,hotel le bleu,"[{'id': '4bf58dd8d48988d1fa931735', 'name': 'H...",v-1592143540,False,370 4th Ave,40.673079,-73.987098,"[{'label': 'display', 'lat': 40.67307905184227...",4071,11215,US,Brooklyn,NY,United States,"[370 4th Ave (5th St), Brooklyn, NY 11215, Uni...",5th St,60455706.0,
4,4fd6211a7b0c4fe0bedb171d,Hotel BPM Brooklyn,"[{'id': '4bf58dd8d48988d1fa931735', 'name': 'H...",v-1592143540,False,139 33rd St,40.656518,-74.003472,"[{'label': 'display', 'lat': 40.65651805231051...",4606,11232,US,Brooklyn,NY,United States,"[139 33rd St (at 4th Ave.), Brooklyn, NY 11232...",at 4th Ave.,35280560.0,


#### Define information of interest and filter dataframe:

In [21]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns_1 = ['name', 'categories'] + [col for col in dataframe_1.columns if col.startswith('location.')] + ['id']
dataframe_filtered_1 = dataframe_1.loc[:, filtered_columns_1]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered_1['categories'] = dataframe_filtered_1.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered_1.columns = [column_1.split('.')[-1] for column_1 in dataframe_filtered_1.columns]

#dataframe_filtered_1

In [22]:
dataframe_filtered_1.shape


(50, 16)

We can now proceed to the Modeling phase. I will analyze neighborhoods to recommend hotels where investor can make a hotel investment. I will then recommend profitable venues according to amenities and essential facilities surrounding such bars, restaurants, gym and finesses and etc.

#### 4. Modeling

After exploring the dataset and gaining insights into it, we are ready to analyze and visualize our data. 

In [23]:
venues_map_1 = folium.Map(location=[latitude_1, longitude_1], zoom_start=11) 

# add a red circle marker to represent the Brooklyn
folium.features.CircleMarker(
    [latitude_1, longitude_1],
    radius=10,
    color='red',
    popup='Brooklyn',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map_1)

# add the restaurants as blue circle markers
for lat, lng, label in zip(dataframe_filtered_1.lat, dataframe_filtered_1.lng, dataframe_filtered_1.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map_1)

# display map
venues_map_1

In [24]:
#categories count
dataframe_filtered_1.groupby('categories').count()

Unnamed: 0_level_0,name,address,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,crossStreet,neighborhood,id
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Breakfast Spot,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1
Building,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1
Convenience Store,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1
Dive Bar,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1
General Travel,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1
Gym / Fitness Center,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1
Historic Site,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1
Hookah Bar,1,0,1,1,1,1,1,1,1,1,1,1,0,0,1
Hostel,2,1,2,2,2,2,1,2,2,2,2,2,0,0,2
Hotel,27,22,27,27,27,27,24,27,26,27,27,27,11,1,27


In [25]:
if len(results_1['response']['venues']) == 0:
    trending_venues_1_df = 'No trending venues are available at the moment!'
    
else:
    trending_venues_1 = results_1['response']['venues']
    trending_venues_1_df = json_normalize(trending_venues_1)

    # filter columns
    columns_filtered_1 = ['name', 'categories'] + ['location.distance', 'location.city', 'location.postalCode', 'location.state', 'location.country', 'location.lat', 'location.lng']
    trending_venues_1_df = trending_venues_1_df.loc[:, columns_filtered_1]

    # filter the category for each row
    trending_venues_1_df['categories'] = trending_venues_1_df.apply(get_category_type, axis=1)

  


In [26]:
if len(results_1['response']['venues']) == 0:
    trending_venues_map_1 = 'Cannot generate visual as no trending venues are available at the moment!'

else:
    trending_venues_map_1 = folium.Map(location=[latitude_1, longitude_1], zoom_start=15) # generate map centred around Brooklyn


    # add Brooklyn as a red circle mark
    folium.features.CircleMarker(
        [latitude_1, longitude_1],
        radius=10,
        popup='Brooklyn',
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.6
    ).add_to(trending_venues_map_1)

     # add the trending venues as blue circle markers
    for lat, lng, label in zip(trending_venues_1_df['location.lat'], trending_venues_1_df['location.lng'], trending_venues_1_df['name']):
        folium.features.CircleMarker(
            [lat, lng],
            radius=5,
            poup=label,
            fill=True,
            color='blue',
            fill_color='blue',
            fill_opacity=0.6
        ).add_to(trending_venues_map_1)

In [27]:
# display map
trending_venues_map_1

### same process for Queens

In [29]:
queens_df = df_NYgeo.loc[df_NYgeo.Borough=="Queens"].reset_index(drop=True)
queens_df.shape

(81, 9)

In [30]:
address_2 = 'Queens, NY'

geolocator_2 = Nominatim(user_agent="qu_explorer")
location_2 = geolocator_2.geocode(address_2)
latitude_2 = location_2.latitude
longitude_2 = location_2.longitude
print('The geograpical coordinate of Queens are {}, {}.'.format(latitude_2, longitude_2))

The geograpical coordinate of Queens are 40.7498243, -73.7976337.


In [31]:
url_2 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude_2, longitude_2, VERSION, search_query, radius, LIMIT)
url_2

'https://api.foursquare.com/v2/venues/search?client_id=2TR1EPEGQC2C2DTPBKNBZ1DCZXQQDIAKU2DHRGXQWEE4HSBP&client_secret=CB2X3UTZBSL1TY3GTZ1KOBQWCQHIEOFO35ELJJCYPUZI4INH&ll=40.7498243,-73.7976337&v=20180604&query=hotel&radius=5000&limit=300'

In [32]:
results_2 = requests.get(url_2).json()
#results_2

In [33]:
# assign relevant part of JSON to venues
venues_2 = results_2['response']['venues']

# tranform venues into a dataframe
dataframe_2 = json_normalize(venues_2)
#dataframe_2.head()

  """


In [34]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns_2 = ['name', 'categories'] + [col for col in dataframe_2.columns if col.startswith('location.')] + ['id']
dataframe_filtered_2 = dataframe_2.loc[:, filtered_columns_2]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered_2['categories'] = dataframe_filtered_2.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered_2.columns = [column_2.split('.')[-1] for column_2 in dataframe_filtered_2.columns]

#dataframe_filtered_2

In [35]:
venues_map_2 = folium.Map(location=[latitude_2, longitude_2], zoom_start=11) # generate map centred around Queens

# add a red circle marker to represent Queens
folium.features.CircleMarker(
    [latitude_2, longitude_2],
    radius=10,
    color='red',
    popup='Queens',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map_2)

# add the restaurants as blue circle markers
for lat, lng, label in zip(dataframe_filtered_2.lat, dataframe_filtered_2.lng, dataframe_filtered_2.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map_2)

# display map
venues_map_2

In [36]:
dataframe_filtered_2.groupby('categories').count()

Unnamed: 0_level_0,name,address,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,crossStreet,neighborhood,id
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Apres Ski Bar,1,1,1,1,1,1,0,1,1,1,1,1,0,0,1
Asian Restaurant,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1
Bed & Breakfast,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1
Building,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1
Convention Center,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1
Event Space,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1
Hotel,29,25,29,29,29,29,27,29,28,29,29,29,8,1,29
Hotel Bar,1,0,1,1,1,1,1,1,1,1,1,1,0,0,1
Indian Restaurant,2,0,2,2,2,2,2,2,2,2,2,2,0,0,2
Motel,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1


In [37]:
if len(results_2['response']['venues']) == 0:
    trending_venues_2_df = 'No trending venues are available at the moment!'
    
else:
    trending_venues_2 = results_2['response']['venues']
    trending_venues_2_df = json_normalize(trending_venues_2)

    # filter columns
    columns_filtered_2 = ['name', 'categories'] + ['location.distance', 'location.city', 'location.postalCode', 'location.state', 'location.country', 'location.lat', 'location.lng']
    trending_venues_2_df = trending_venues_2_df.loc[:, columns_filtered_2]

    # filter the category for each row
    trending_venues_2_df['categories'] = trending_venues_2_df.apply(get_category_type, axis=1)

  


In [38]:
if len(results_2['response']['venues']) == 0:
    trending_venues_map_2 = 'Cannot generate visual as no trending venues are available at the moment!'

else:
    trending_venues_map_2 = folium.Map(location=[latitude_2, longitude_2], zoom_start=11) # generate map centred around Queens


    # add Queens as a red circle mark
    folium.features.CircleMarker(
        [latitude_2, longitude_2],
        radius=10,
        popup='Queens',
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.6
    ).add_to(trending_venues_map_2)


    # add the trending venues as blue circle markers
    for lat, lng, label in zip(trending_venues_2_df['location.lat'], trending_venues_2_df['location.lng'], trending_venues_2_df['name']):
        folium.features.CircleMarker(
            [lat, lng],
            radius=5,
            poup=label,
            fill=True,
            color='blue',
            fill_color='blue',
            fill_opacity=0.6
        ).add_to(trending_venues_map_2)

In [39]:
# display map
trending_venues_map_2