<h1 align='center'> Finding the Top Neighborhoods to Open a Restaurant in NYC</h1>

# 1. Introduction

New York City, often called 'The City' or simply 'New York', is the most populous city in the United States. New York City has been described as the cultural, financial, and media capital of the world, significantly influencing commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square mile, New York is also the most densely populated major city in the United States.

With such a densely populated city, it is found that New York is filled with restaurants in almost all neighborhoods. In such a city, it is often difficult for someone to find the best place to open their own restaurant.

In this project, we will be leveraging data from various sources on the internet, to help us find the best neighborhoods for a stakeholder to open their restaurant, in order to get guaranteed business and footfall.

# 2. Data

## 2.1 Source

In order to carry forward this project we will need to gather the required data from the internet. The larger the dataset, the better our model will perform in finding the best neighborhoods in New York.

One of the main datasets we will be using is the neighborhood dataset for New York City. We will use <a href='https://cocl.us/new_york_dataset'>this</a> NYC dataset to get all the data we need for NYC Neighborhoods and Boroughs.

We will also be leveraging the <a href='https://developer.foursquare.com/'>Foursquare API</a> in this project, in order to get data related to all the restaurants(venues) in a neighborhood.

## 2.2 Acquisition

**Let's fetch and load the NYC dataset with all the Neighborhoods and Boroughs in NYC**

In [1]:
import json # importing a library required for handling .json files
import pandas as pd # importing a library useful for handling data with DataFrames

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset # Fetching the data from a remote cloud server
with open('newyork_data.json') as json_data: # Opening the .json file and loading the data into a variable
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features'] # Since all the data we need is strored in the 'features' key, we will have to extract that from the raw json

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] # Defining the columns of the DataFrame
neighborhoods = pd.DataFrame(columns=column_names) # Creating a DataFrame for storing the NYC data

for data in neighborhoods_data: # Looping through the data points in out dataset and storing them in a pandas DataFrame for easier analysis and usage 
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)


print('New York City Neighborhood Data Loaded with {} boroughs and {} neighborhoods!'.format(len(neighborhoods['Borough'].unique()),neighborhoods.shape[0]))

New York City Neighborhood Data Loaded with 5 boroughs and 306 neighborhoods!


In [3]:
# Check what the dataset looks like
print(neighborhoods.shape)
neighborhoods.head()

(306, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


**Now that we have the NYC dataset loaded, we need to use the Foursquare API to fetch all the venues in NYC neighborhoods. Here we'll be creating a function that returns the closest 100 venues given a neighborhood's coordinates and this function will be called form wherever we need it.**

In [4]:
import requests

In [5]:
# The code was removed by Watson Studio for sharing.

In [6]:
# Creating a function for retrieving the top 100 venues in every neighborhood, which can be used when required
LIMIT = 100
radius = 500
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            name, 
            lat,
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [7]:
# Using the function that we had written, to retrieve the venue data for each neighorhood and then store this data in a new pandas DataFrame
newyork_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude'])
newyork_venues.head() # Check what the dataset looks like

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
2,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


##  2.3 Cleaning

**The NYC dataset that we are using in this project is extremely accurate, clean and structured, and hence it doesn't require any cleaning (like removing missing values, etc). The dataset can be used as is due to its high quality**

**The Foursquare API is an API that has been created to ensure that the user is able to retrieve clean, accurate and a large variety of data for their own use. Hence, the Foursquare data does not require any cleaning.**

# 3. Methodology & Code

**Let's first visualize the data we just acquired so we know what we are dealing with. We will be using a library called Folium, that helps us build interactive maps in Python.**

In [12]:
#!pip install folium
import folium

In [13]:
# Using the geo coordinates of New York City
latitude = 40.7127281
longitude = -74.0060152

# Creating a Folium Map object
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=11)

# Adding markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

**Now we need to use the Foursquare API data to get the number of restaurants in every neighborhood.**

In [14]:
newyork_venues = newyork_venues[newyork_venues['Venue Category'].str.contains('Restaurant')] # Filtering the venues that aren't restaurants
newyork_venues = pd.DataFrame(newyork_venues['Neighborhood'].value_counts().reset_index().values, columns=["Neighborhood", "Restaurants"]) # Creating a dataframe with the number of restaurants in every neighborhood
newyork_venues.head()

Unnamed: 0,Neighborhood,Restaurants
0,Murray Hill,47
1,Jackson Heights,39
2,East Village,37
3,Astoria,37
4,Greenwich Village,37


**We will now merge this restaurants data with our NYC dataset to get a complete dataset with neighborhoods and the number of restaurants they have.**

In [15]:
neighborhoods_data = pd.merge(neighborhoods, newyork_venues, how='outer') # Merging both datasets making sure we don't drop rows for which there is nor restaurant information
neighborhoods_data.fillna(0, inplace=True) # Replacing the null values created in the merge with 0
neighborhoods_data.head() # Checking what the final dataset looks like

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Restaurants
0,Bronx,Wakefield,40.894705,-73.847201,0
1,Bronx,Co-op City,40.874294,-73.829939,2
2,Bronx,Eastchester,40.887556,-73.827806,6
3,Bronx,Fieldston,40.895437,-73.905643,0
4,Bronx,Riverdale,40.890834,-73.912585,0


**With this full-fledged dataset we can now find the top neighborhoods for a stakeholder to open a restaurant**

In [16]:
top_neighborhoods = neighborhoods_data.sort_values(by='Restaurants', ascending=False).reset_index()
top_neighborhoods[['Borough','Neighborhood', 'Restaurants']].head()

Unnamed: 0,Borough,Neighborhood,Restaurants
0,Manhattan,Murray Hill,47
1,Queens,Murray Hill,47
2,Queens,Jackson Heights,39
3,Manhattan,East Village,37
4,Manhattan,Greenwich Village,37


**Now that we have found out the top 5 neighborhoods with the most restaurants, lets visualize this merged dataset on a real NYC map to get a sense of where they are located**

In [28]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=11)

max_restaurants = neighborhoods_data['Restaurants'].max()
color_divs = [0, max_restaurants/4, (max_restaurants/4)*2, (max_restaurants/4)*3, max_restaurants]

# Adding markers to map
for lat, lng, borough, neighborhood, restaurants in zip(neighborhoods_data['Latitude'], neighborhoods_data['Longitude'], neighborhoods_data['Borough'], neighborhoods_data['Neighborhood'], neighborhoods_data['Restaurants']):
    label = '{}, {}: {}'.format(neighborhood, borough, restaurants)
    # Assigning the color value depending on the number of restaurants in a neighborhood
    if restaurants >= color_divs[0] and restaurants < color_divs[1]:
        colorval = 'blue'
        fillcolor = 'lightblue'
    elif restaurants >= color_divs[1] and restaurants < color_divs[2]:
        colorval = 'green'
        fillcolor = 'lightgreen'
    elif restaurants >= color_divs[2] and restaurants < color_divs[3]:
        colorval = 'orange'
        fillcolor = '#ffb347'
    else:
        colorval = 'red'
        fillcolor = '#ffcccb'
        if restaurants==max_restaurants:
            folium.Marker([lat, lng], popup=label, icon=folium.Icon(color=colorval)).add_to(map_newyork)
        else:
            folium.Marker([lat, lng], popup=label).add_to(map_newyork)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color= colorval,
        popup=label,
        fill=True,
        fill_color= fillcolor,
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

# 4. Results & Discussion

Our analysis shows that although there are restaurants in almost every neighborhood in NYC, a majority of them are pockets of low restaurant density, with a few neighborhoods having average and medium density. Highest concentration of restaurants was detected in the following neighborhoods in decreasing order of the number of restaurants:-
1. Murray Hill, Manhattan, 47 restaurants
2. Murray Hill, Queens, 47 restaurants
3. Jackson Heights, Queens, 39 restaurants
4. Astoria, Queens, 37 restaurants
5. East Village, Manhattan, 37 restaurants
6. Greenwich Village, Manhattan, 37 restaurants

From the results we can state that all the neighborhoods of high restaurant density were located in the Manhattan and Queens boroughs. It is therefore also implied that a stakeholder will experience high footfall if they open a new restaurant in any of these neighborhoods.

Manhattan and Queens also have numerous neighborhoods that have average to medium restaurant density.

Low restaurant density was found in a majority of neighborhoods in NYC. This also shows that the results of our project will enable stakeholders to ensure that they do not open their restaurants in areas with low density.

# 5. Conclusion

The purpose of this project was to identify neighborhoods in NYC with high restaurant density in order to aid stakeholders in narrowing down the search for optimal location for a new restaurant. By using the NYC neighborhoods and boroughs dataset along with the data from the FourSquare API we were able to successfully analyze the neighborhoods and find the neighborhoods with the most restaurants.

A future feature implementation and analysis of this project can include factors like attractiveness of each location (proximity to park or water), levels of noise, proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc, to help stakeholders decide on the optimal location for their restaurant.