 # Capstone project: Battle of the neighborhoods

### Recommendation of a new office location for food delivery industry in Seattle

## Introduction

Food delivery industry is very competitive these days and the delivery time matters most for customers satisfaction. Besides, food must be kept at a consistent temperature to stay fresh and tasty. Choosing the right location of the office or distribution center for efficiency and fast delivery of the food is crucial to the business. It is best for the stakeholder of this project to start from a small area and contact local popular restaurants as much as possible and gradually increase the restaurant base in new areas. In order to get more food delivery orders and be competitive, the office needs to be close to most of the popular restaurants among the neighborhoods as much as it can

This project would help the stakeholders to make a better decision on choosing the location of the new office in Seattle. The new office can enhance the efficiency of the delivery system and cut-down the operating costs to minimal.

## Data source


The project will make use of the following data sources:

#### Postal codes of different neighborhoods in Seattle 
Postal codes in Seattle will retrieved from this website
http://seattlearea.com/zip-codes/

#### Neighborhood location data retrieved using Google maps API
Location coordinates of different neighborhoods will be returned using Google API

#### Finding popular venues in different neighborhoods from FourSquare API
I will use the FourSquare API to explore different neighborhoods in Seattle.  The Foursquare explore function will be used to get the most popular venues in each neighborhood.

## METHODOLOGY

The postal codes of different neighborhoods in Seattle will be collected by Web scraping(HTTP request) in this web page "http://seattlearea.com/zip-codes/". Data will be cleaned and selected by Beautiful Soup library. After that, the geographical coordinates, latitude, and longitude will be retrieved by using the Google API. The neighborhood with name, postal code, geographical coordinates will be regrouped and cleaned by Pandas library.  Then FourSquare API will be called to retrieve all the venues in different neighborhoods. Finally, the machine learning algorithm K-means clustering would be applied to form the clusters of different venues categories for all the neighborhoods. 

### First of all, we need to import all necessary python library

In [1]:
from bs4 import BeautifulSoup
# import urllib.request as request
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

#  libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

'conda' is not recognized as an internal or external command,
operable program or batch file.
'conda' is not recognized as an internal or external command,
operable program or batch file.


Folium installed
Libraries imported.


### Get all the zip codes in seattle area

In [2]:
link = "http://seattlearea.com/zip-codes/"

In [3]:
def get_table_data(tableClassname, cols):
    
    custom_header = {}
    custom_header['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"

    html = requests.get(link, headers=custom_header).text


    soup  = BeautifulSoup(html, 'html.parser')
    raw_data = soup.find('div',attrs={'class': tableClassname})


    return raw_data

In [4]:
raw_data = get_table_data('entry-content', 1)
print (raw_data)

<div class="entry-content">
<br/>
<script type="text/javascript">
					google_ad_client = "ca-pub-0171608308593916";	
					google_ad_slot = "0003693249";
					google_ad_width = 336;
					google_ad_height = 280;
					google_color_link="2f86cd";</script>
<script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"></script>
<br/>
<br/><div><b>Neighborhood Zipcodes</b></div><div>» 98003 - Federal Way</div><div>» 98005 - Bellevue</div><div>» 98033 - Kirkland</div><div>» 98037 - Lynnwood</div><div>» 98040 - Mercer Island</div><div>» 98052 - Redmond</div><div>» 98055 - Renton</div><div>» 98101 - Seattle</div><div>» 98101 - Downtown</div><div>» 98102 - Capital Hill</div><div>» 98103 - Greenwood</div><div>» 98103 - Freemont</div><div>» 98103 - Greenlake</div><div>» 98104 - International District</div><div>» 98104 - Pioneer Square</div><div>» 98105 - University District</div><div>» 98105 - Laurelhurst</div><div>» 98107 - Ballard</div><div>» 98109 - South Lake Uni

In [5]:
# Reformat the data

line_data = []

for i in raw_data:
    line_data.append(i)

neighborhood = []
zipcode = []
# line_data

# Exclude the first 11 lines
for b in line_data[11:]:
    try:
        k = b.text.split()
#         print(k)
        zipcode.append(k[1])
        if len(k) == 5:
            neighborhood.append(k[3]+' '+k[4])
        else:
            neighborhood.append(k[3])
                
    except:
           pass
        

df=pd.DataFrame({'PostalCode':zipcode, 'Neighborhood':neighborhood})

In [6]:
df

Unnamed: 0,PostalCode,Neighborhood
0,98003,Federal Way
1,98005,Bellevue
2,98033,Kirkland
3,98037,Lynnwood
4,98040,Mercer Island
5,98052,Redmond
6,98055,Renton
7,98101,Seattle
8,98101,Downtown
9,98102,Capital Hill


### Retrieve the data by Google API

In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
lats = [] # collection the latitudes
lngs = [] # collection the longitudes

for i in df['PostalCode']: 
    try:
        
        url ="https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}".format(API_key,i)
        response = requests.get(url).json() 
#         print ("This is the response: ", response)
        geographical_data = response['results'][0]['geometry']['location'] 
        lats.append(geographical_data['lat'])
        lngs.append(geographical_data['lng'])
    except:
        pass
    
df['Latitude']=lats 

df['Longitude']=lngs 


ValueError: Length of values does not match length of index

In [9]:
df.reset_index(drop=True)

Unnamed: 0,PostalCode,Neighborhood
0,98003,Federal Way
1,98005,Bellevue
2,98033,Kirkland
3,98037,Lynnwood
4,98040,Mercer Island
5,98052,Redmond
6,98055,Renton
7,98101,Seattle
8,98101,Downtown
9,98102,Capital Hill


In [14]:
df.to_csv('seattle_lat_lng_neighborhood.csv')

#### Use geopy library to get the latitude and longitude values of Seattle
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent seattle_explorer, as shown below.

In [15]:
address = 'Seattle, Washington'

geolocator = Nominatim(user_agent="seattle_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Seattle are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Seattle are 47.6038321, -122.3300624.


#### Let's find out how many unique categories can be curated from all the returned venues

### Define Foursquare Credentials and Version

In [36]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: VIP
CLIENT_SECRET:TOP SECRET


##  Explore Neighborhoods in Seattle

#### Create a function to get the top 100 venues in each neighborhoods  within a radius of 1000 meters in Seattle

In [18]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 1000 # define radius to 1000m

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Run the above function on each neighborhood and create a new dataframe called seattle_venues.

In [20]:
seattle_data = pd.read_csv('seattle_lat_lng_neighborhood.csv')
seattle_venues = getNearbyVenues(names=seattle_data['Neighborhood'],
                                   latitudes=seattle_data['Latitude'],
                                   longitudes=seattle_data['Longitude']
                                  )


Federal Way
Bellevue
Kirkland
Lynnwood
Mercer Island
Redmond
Renton
Seattle
Downtown
Capital Hill
Greenwood
Freemont
Greenlake
International District
Pioneer Square
University District
Laurelhurst
Ballard
South
Queen Anne
Bainbridge Island
Madrona
West Seattle
Alki Beach
Columbia City
Belltown
Northgate
Mount Baker
Magnolia


#### The size of the resulting dataframe

In [21]:
seattle_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alki Beach,62,62,62,62,62,62
Bainbridge Island,1,1,1,1,1,1
Ballard,100,100,100,100,100,100
Bellevue,39,39,39,39,39,39
Belltown,100,100,100,100,100,100
Capital Hill,82,82,82,82,82,82
Columbia City,19,19,19,19,19,19
Downtown,100,100,100,100,100,100
Federal Way,100,100,100,100,100,100
Freemont,100,100,100,100,100,100


### Find out how many unique categories can be curated from all the returned venues

In [22]:
print('There are {} uniques categories.'.format(len(seattle_venues['Venue Category'].unique())))

There are 258 uniques categories.


### The name list of all Venue Categories

In [23]:
 seattle_venues['Venue Category'].unique()

array(['Gym', 'Liquor Store', 'Japanese Restaurant',
       'Fast Food Restaurant', 'Bookstore', 'Mexican Restaurant',
       'Soup Place', 'Arts & Crafts Store', 'Coffee Shop',
       'Gym / Fitness Center', 'Pet Store', 'Chinese Restaurant',
       'Grocery Store', 'Thai Restaurant', 'Café', 'Bakery',
       'Salon / Barbershop', 'Diner', 'Miscellaneous Shop', 'Gun Range',
       'Korean Restaurant', 'Vietnamese Restaurant', 'Cosmetics Shop',
       'Donut Shop', 'Spa', 'Pizza Place', 'Weight Loss Center',
       'Mobile Phone Shop', 'Ice Cream Shop', 'Video Game Store',
       'Shipping Store', 'Optical Shop', 'Gas Station', 'Gift Shop',
       'Ramen Restaurant', 'ATM', 'Shoe Store', 'Sandwich Place',
       'Sports Bar', 'Playground', 'Fabric Shop', 'American Restaurant',
       'Sporting Goods Shop', 'Furniture / Home Store',
       'Italian Restaurant', 'Electronics Store', 'Fried Chicken Joint',
       'Bank', 'Lighting Store', 'Hawaiian Restaurant', 'Clothing Store',
       'B

## Analyze Each Neighborhood

In [24]:
# one hot encoding
seattle_onehot = pd.get_dummies(seattle_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
seattle_onehot['Neighborhood'] = seattle_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [seattle_onehot.columns[-1]] + list(seattle_onehot.columns[:-1])
seattle_onehot = seattle_onehot[fixed_columns]

seattle_onehot.head()

Unnamed: 0,Zoo Exhibit,ATM,Accessories Store,Advertising Agency,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vietnamese Restaurant,Warehouse Store,Watch Shop,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [26]:
seattle_onehot.shape

(1860, 258)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [27]:
seattle_grouped = seattle_onehot.groupby('Neighborhood').mean().reset_index()
seattle_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,Accessories Store,Advertising Agency,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vietnamese Restaurant,Warehouse Store,Watch Shop,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
0,Alki Beach,0.0,0.0,0.0,0.016129,0.0,0.0,0.032258,0.0,0.0,...,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bainbridge Island,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Ballard,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
3,Bellevue,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Belltown,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.01,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Capital Hill,0.0,0.0,0.0,0.0,0.0,0.036585,0.012195,0.012195,0.012195,...,0.0,0.0,0.012195,0.0,0.0,0.0,0.012195,0.0,0.0,0.012195
6,Columbia City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downtown,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0,...,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.01,0.01
8,Federal Way,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.01,...,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
9,Freemont,0.04,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01


#### Let's confirm the new size

In [28]:
seattle_grouped.shape

(29, 258)

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 5 venues for each neighborhood.

In [30]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = seattle_grouped['Neighborhood']

for ind in np.arange(seattle_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(seattle_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Alki Beach,Park,Beach,Coffee Shop,Ice Cream Shop,Trail
1,Bainbridge Island,Skate Park,Yoga Studio,French Restaurant,Food Truck,Food Stand
2,Ballard,Brewery,Coffee Shop,Ice Cream Shop,Sandwich Place,New American Restaurant
3,Bellevue,Spa,Rental Car Location,Automotive Shop,Coffee Shop,Furniture / Home Store
4,Belltown,Sushi Restaurant,Bar,Breakfast Spot,Sculpture Garden,Pizza Place
5,Capital Hill,Coffee Shop,Sandwich Place,Bus Stop,Italian Restaurant,Garden
6,Columbia City,Mexican Restaurant,Park,Bus Line,Video Store,Medical Center
7,Downtown,Coffee Shop,Hotel,American Restaurant,Spa,Cocktail Bar
8,Federal Way,Japanese Restaurant,Korean Restaurant,Shoe Store,Mexican Restaurant,Miscellaneous Shop
9,Freemont,Coffee Shop,Japanese Restaurant,Zoo Exhibit,Thai Restaurant,Bar


## Cluster Neighborhoods

Run k-means to cluster the neighborhood into 10 clusters.

In [31]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 10

seattle_grouped_clustering = seattle_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(seattle_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([4, 1, 0, 3, 0, 3, 9, 0, 3, 3], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [33]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

seattle_merged = seattle_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
seattle_merged = seattle_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

seattle_merged # check the last columns!

Unnamed: 0.1,Unnamed: 0,Neighborhood,PostalCode,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0,Federal Way,98003,47.316504,-122.322398,3,Japanese Restaurant,Korean Restaurant,Shoe Store,Mexican Restaurant,Miscellaneous Shop
1,1,Bellevue,98005,47.615044,-122.171758,3,Spa,Rental Car Location,Automotive Shop,Coffee Shop,Furniture / Home Store
2,2,Kirkland,98033,47.66883,-122.192387,6,Asian Restaurant,Sandwich Place,Convenience Store,Grocery Store,Gas Station
3,3,Lynnwood,98037,47.841953,-122.288181,8,Fast Food Restaurant,Pizza Place,Coffee Shop,Pharmacy,Park
4,4,Mercer Island,98040,47.582423,-122.233123,3,Coffee Shop,Pizza Place,Park,Sandwich Place,Pharmacy
5,5,Redmond,98052,47.670119,-122.118237,3,Bakery,Sandwich Place,Coffee Shop,Gym / Fitness Center,Mexican Restaurant
6,6,Renton,98055,47.462337,-122.205506,7,Residential Building (Apartment / Condo),Coffee Shop,Supermarket,Optical Shop,Bar
7,7,Seattle,98101,47.608492,-122.336407,0,Coffee Shop,Hotel,American Restaurant,Spa,Cocktail Bar
8,8,Downtown,98101,47.608492,-122.336407,0,Coffee Shop,Hotel,American Restaurant,Spa,Cocktail Bar
9,9,Capital Hill,98102,47.633822,-122.321545,3,Coffee Shop,Sandwich Place,Bus Stop,Italian Restaurant,Garden


Finally, let's visualize the resulting clusters

In [34]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(seattle_merged['Latitude'], seattle_merged['Longitude'], seattle_merged['Neighborhood'], seattle_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    
#     cluster.astype(int)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
#         color = 'red',
        fill=True,
        fill_color=rainbow[cluster-1],
#         fill_color = 'red',
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Discussion and conclusion

It is no doubt that Seattle is a coffee-saturated city. Based on Four Square API search result, we can see the coffee shop venue category is always the most common venue in different neighborhoods.

Among of all neighborhoods, Northgate is the probably the best neighborhood to start a food delivery business. According to the findings, the top six of the most common venues are related to food or drinks with different varieties. Those venues categories are Sushi Restaurant, Sandwich Place, Mexican Restaurant, Thai Restaurant, Coffee shop and Pizza place. Hence, the chances of getting food order delivery are very high.

From the map of K-means clustering result, the three neighborhoods, Pioneer Square, Downtown and Belltown are very close to each other geographically. The most common venues category in those areas is the Coffee shop, Italian Restaurant, Sushi Restaurant, and American Restaurant, etc. Hence, this special condensed area is definitely a good option for the office location of the food delivery service other than Northgate.

This result of this analysis has some limitations. It cannot reflect the behavior of the customer in that area. People like to visit the restaurants or stop by the coffee shops does not mean they want food delivery services. The customers may enjoy the service and the time of being there, more than just having the food at their own places. Rental cost is not taken into consideration too. The three neighborhoods Pioneer Square, Downtown and Belltown may be the most expensive rental area in Seattle
