# Finding the Best Area to Start a Restaurant Business in San Francisco


This project offers an infographic view of the demographics and restaurant competitions in each neighborhood of San Francisco. Anyone who wants to start a new restaurant in San Francisco can use the report or the interactive tool on the Jupyter Notebook as a guide to find the optimal place to start a restaurant based on the two elements. 

## Import Libraries
Here we import the necessary libraries.
* numpy, Pandas -- Standard Data Analytics Libraries  
* geocoder -- Finding Location Data for Neighborhoods
* request -- HTML Request
* matplotlib, seaborn -- Data Visualization
* Folium -- Creating Map



In [None]:
import numpy as np 
import pandas as pd 

!pip install geocoder
import geocoder

import requests

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")

import folium

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data Acquisition and Cleaning
There are two datasets used in the project.
1. The demographics by neighborhood data are from San Francisco Planning Department (https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2012-2016_ACS_Profile_Neighborhoods_Final.pdf). Relevant data have been picked out and put in a CSV file.
2. The restaurant competition data are to be compiled from Foursquare API.


### Demographics Dataset
The demographics data is available in CSV format. We will read the data into Pandas dataframe, and clean it.

In [None]:
demographics = pd.read_csv('/kaggle/input/sf-demographics-data/SF Demographics Dataset.csv')

Let's take a quick look at the demographics dataset.

In [None]:
demographics.head()

Let's see what the datatype of each columns is.

In [None]:
demographics.info() 

There are total 41 neighborhoods of San Francisco in the dataset. Upon examination, we will convert all ethnicity data to floats, and fill n/a values with 0s.


In [None]:
demographics = demographics.fillna(0)

In [None]:
demographics['White'] = demographics['White'].astype('float64')
demographics['Other/Two or More Races'] = demographics['Other/Two or More Races'].astype('float64')
demographics['% Latino (of Any Race)'] = demographics['% Latino (of Any Race)'].astype('float64')

Let's take a look at the dataset after cleaning.

In [None]:
demographics.describe()

We will perform more exploratory data analysis after compiling the restaurant data from Foursquare.

### Restaurant Competition Dataset

In order to use the Foursquare API to find restaurant data, we need to first find the longitude and latitude for each neighborhood.

In [None]:
neighborhoods = demographics['Neighborhood'].to_list()

longitude = []
latitude = []

for neighborhood in neighborhoods:
    
    # initialize the variable to None
    lat_lng_coords = None

    # loop until getting the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, San Francisco, California'.format(neighborhood))
        lat_lng_coords = g.latlng

    
    # Append the data to the lists
    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])

Creating a Pandas dataframe containing the location information for each neighborhood.

In [None]:
location = pd.DataFrame({'Neighborhood': neighborhoods, 'Latitude': latitude, 'Longitude': longitude})

Let's take a quick look at the location dataframe.

In [None]:
location.head()

Now let's get restaurant data within 1 mile radius of each neighborhood's latitude and longitude. We have implemented a function **getNearbyVenues** to find the data via Foursquare API and save it to a CSV file. To avoid sending API request everytime running the notebook, we have commented out the function, and will read the information from the CSV file.

In [None]:
# Setting Foursquare credentials
CLIENT_ID = 'DASAS2TJ5QYKKAI2QZEPBF0XACCR5JAX0JL4OKNFPI1SYN0K' # your Foursquare ID
CLIENT_SECRET = 'OXNV1ECFX2G4ZYPKP5BDAYI1OZPA1SYVZDIMCKLDSB05OEPE' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

This is the function to get all venues near a neighborhood.

In [None]:
"""
def getNearbyVenues(names, latitudes, longitudes, radius=1600, LIMIT=300, categoryId='4d4b7105d754a06374d81259'):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            categoryId)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

venues = getNearbyVenues(names=location['Neighborhood'],
                                   latitudes=location['Latitude'],
                                   longitudes=location['Longitude']
                                  )"""

Let's read the venues CSV file into Pandas dataframe.

In [None]:
venues = pd.read_csv('/kaggle/input/sf-venues/SF venues.csv')

Each row of the CSV file is information of one restaurant. For this project's purpose, we need to group the restaurants by the Neighborhood they are in. So let's take a look at how many restaurants data we have in each neighborhood first.

In [None]:
venues.groupby('Neighborhood').count()

Most of the neighborhoods have 100 data points, as the max limit set by Foursquare API is 100. 

Now let's convert the **venue** dataframe into **venue_count** dataframe with the columns being the amount of each type of restaurant.

In [None]:
venue_count = venues.groupby(['Neighborhood', 'Venue Category'])['Venue'].count()
venue_count = venue_count.unstack()
venue_count = venue_count.fillna(0)

In [None]:
venue_count.head()

## Exploratory Data Analysis
As the project is focused on creating a visualization for demographics and restaurant competition information for each neighborhood, there is less need to draw insights from the datasets by its own. But it is still quite interesting to examine them especially the demographics dataset. 
<br>
Couple things of interest are the distribution of population, distribution of median household income, and whether there’s a correlation between median household income and percentage of each race.


Here we use a bar plot to visualize each neighborhood's population.

In [None]:
plt.figure(figsize=(20,10))
ax = sns.barplot(demographics['Neighborhood'], demographics['Total Population'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

Here we use a box plot to see the distribution of population and whether there's any outliers.

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(demographics['Total Population'])
plt.show()

Here we use a bar plot to visualize each neighborhood's median household income.

In [None]:
plt.figure(figsize=(20,10))
ax = sns.barplot(demographics['Neighborhood'], demographics['Median Household Income'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

Here we use a box plot to see the distribution of median household income and whether there's any outliers.

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(demographics['Median Household Income'])
plt.show()

Here we use a heatmap to visualize the correlation between features.

In [None]:
dem_corr = demographics.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(dem_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(dem_corr, mask=mask, cmap=cmap, vmax=1, vmin=-1,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

The line of our concern is Median Household Income vs the races. We can see that higher percentage of White household corresponds with higher median household income, while the high percentage of any other race household corresponses negatively with median household income. 

## Mapping the Results
We will create the interactive map visualizing the relevant information here.


This is a function to create the messages used in the map.

In [None]:
import math
def get_info(venue_count, demographics):
    neighs = []
    infos = []
    dem_keys = demographics.iloc[0][['Asian', 'Black/African American', 'White', 'Native American Indian', 'Native Hawaiian/Pacific Islander', 'Other/Two or More Races', '% Latino (of Any Race)']].keys()
    for i in range(len(venue_count)):
        neigh = "<b>" + venue_count.iloc[i].name + "</b>"
        message = ""
        message += neigh
        message = message + '<br>Population: ' + str(demographics['Total Population'][i])
        message += '<br><br>Race (%):<ul> '
        for key in dem_keys:
            message = message + '<li>' + key + ': ' + str(demographics.iloc[i][key]) + '</li>' 
        message += '</ul>'
        message += '<p style="width:200px"><i>Most common restaurant:</i></p><ol>'
        top_keys = venue_count.iloc[i].sort_values(ascending=False).keys()[:5]
        top_values = venue_count.iloc[i].sort_values(ascending=False).values[:5]
        for j in range(5):
            message = message + '<li>' + top_keys[j] + ': ' + str(math.trunc(top_values[j])) + '</li>'
        message += '</ol>'
        neighs.append(neigh)
        infos.append(message)
    return neighs, infos

Here we create the map with Folium.

In [None]:
m = folium.Map(
    location=[37.7749, -122.4194],
    zoom_start=12  
)

restaurants = folium.map.FeatureGroup()

for neighborhood, lat, lng in zip(location['Neighborhood'], location['Latitude'], location['Longitude']):
    restaurants.add_child(
        folium.vector_layers.CircleMarker(
            [lat, lng],
            radius=5, 
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

latitudes = list(location['Latitude'])
longitudes = list(location['Longitude'])
neighs, infos = get_info(venue_count, demographics)



for lat, lng, neigh, info in zip(latitudes, longitudes, neighs, infos):
    folium.map.Marker([lat, lng], popup=folium.map.Popup(html=info, parse_html=False, max_width='300px'), tooltip=neigh).add_to(m)    
    
m.add_child(restaurants)

In [None]:
m.save('map.html') 