# Opening a new Chinese restaurant in Boston, MA


### 1. Introduction/Business Problem

Because of the influx of international students in Boston, especially Chinese students, there have been a good amount Chinese restaurants opened up in recent years. The purpose of this project is to examine each Boston neighborhood by using FourSquare API and find out what is the best location to open up a new Chinese restaurant. Target audience would be someone who's trying to open a Chinese restaurant in the Metro Boston area.

### 2. Project Plan

1. Get list of neighborhoods in Boston by scraping Wikipedia page.
2. Get geo coordinates of all neighborhoods and store in a dataframe.
3. Get venue data from FourSquare API.
4. Perform Clustering and Segmentation.
5. Make final decision of the best location to open a new Chinese restaurant in Boston, MA.

### 3. Use of Data

This project will use two sets of data:
    
1. Neighborhoods in Boston(https://en.wikipedia.org/wiki/Neighborhoods_in_Boston): The purpose of using this data is to get a list of neighborhoods in Boston and later on get their corresponding geo locations.
2. Venue data in Boston (FourSquare API): The purpose of using this data is to find out how many existing Chinese restaurants are there in each neighborhood, therefore choose the neighrborhood with the least density of Chinese restaurants.

<b>\*Please only check the above for week 1 submission.\*</b>

### 4. Project Details

#### 1. Import libraries

In [1]:
# import libraries

import pandas as pd
import numpy as np

# For web scraping
from bs4 import BeautifulSoup as bs
import requests

# To get geo-location data
from geopy.geocoders import Nominatim

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes 
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.0               |             py_0          26 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    openssl-1.1.1e             |       h516909a_0         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    ------------------------------------------------------------
                       

#### 2. Get data from scraping wikipedia

In [2]:
# Get list of neighborhoods in Boston

source = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_Boston").text
soup = bs(source, 'lxml')

In [3]:
list = soup.find('div', class_='columns')

In [4]:
# Store the list in an array

neigh = []

for n in list.find_all('li'):
    neigh.append(n.text)

In [5]:
# Convert the array to a pandas dataframe

df = pd.DataFrame({'Neighborhoods': neigh})

In [6]:
# Clean up the dataframe

df.replace(regex={r'^Dorchester.*$': 'Dorchester', r'^Fenway.*$': 'Fenway Kenmore', r'^Chinatown.*$': 'Chinatown'}, inplace=True)

In [7]:
df.shape

(22, 1)

#### 3. Get geo-locations of all neighborhoods

In [8]:
# Get latitude and longitude data for each neighborhood

geolocator = Nominatim(user_agent='boston_agent')

def get_geo_location(neighborhood):
    g = None
    while (g is None):
        location = geolocator.geocode('{}, Boston, MA'.format(neighborhood))
        g = [location.latitude, location.longitude]
    return g

In [9]:
coords = [ get_geo_location(nbhd) for nbhd in df["Neighborhoods"].tolist() ]

In [10]:
coords

[[42.3554344, -71.1321271],
 [42.3507067, -71.0797297],
 [42.35001105, -71.0669477958571],
 [42.3587085, -71.067829],
 [42.3500971, -71.1564423],
 [42.3778749, -71.0619957],
 [42.3513291, -71.0626228],
 [42.2973205, -71.0744952],
 [52.971148799999995, -0.059809371175602276],
 [42.3750973, -71.0392173],
 [42.34422445, -71.09444515776886],
 [42.2556543, -71.1244963],
 [42.3098201, -71.1203299],
 [42.2675657, -71.0924273],
 [42.33255965, -71.10360773640765],
 [42.3650974, -71.0544954],
 [42.2912093, -71.1244966],
 [42.3248426, -71.0950158],
 [42.3334312, -71.0494949],
 [42.34131, -71.0772298],
 [42.3639186, -71.0638993],
 [42.2792649, -71.1494972]]

In [11]:
df_coords = pd.DataFrame(coords, columns=["Latitude", "Longitude"])

In [12]:
df["Latitude"] = df_coords["Latitude"]
df["Longitude"] = df_coords["Longitude"]

df

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Allston,42.355434,-71.132127
1,Back Bay,42.350707,-71.07973
2,Bay Village,42.350011,-71.066948
3,Beacon Hill,42.358708,-71.067829
4,Brighton,42.350097,-71.156442
5,Charlestown,42.377875,-71.061996
6,Chinatown,42.351329,-71.062623
7,Dorchester,42.29732,-71.074495
8,Downtown,52.971149,-0.059809
9,East Boston,42.375097,-71.039217


#### 4. Create a map of Boston

In [13]:
# get the coordinates of Boston
address = 'Boston, MA'

geolocator = Nominatim(user_agent="boston-agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Boston, MA {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Boston, MA 42.3602534, -71.0582912.


In [14]:
# create map of Boston using latitude and longitude values

map_bos = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhoods']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_bos)  
    
map_bos

#### 5. Use FourSquare API

In [15]:
# The code was removed by Watson Studio for sharing.

In [16]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhoods']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [17]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(1914, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Allston,42.355434,-71.132127,Lulu's Allston,42.355068,-71.134107,Comfort Food Restaurant
1,Allston,42.355434,-71.132127,Fish Market Sushi Bar,42.353039,-71.132975,Sushi Restaurant
2,Allston,42.355434,-71.132127,BonChon Chicken,42.353105,-71.130921,Fried Chicken Joint
3,Allston,42.355434,-71.132127,Mala Restaurant,42.35296,-71.131033,Chinese Restaurant
4,Allston,42.355434,-71.132127,Kaju Tofu House,42.354329,-71.132374,Korean Restaurant


In [18]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allston,100,100,100,100,100,100
Back Bay,100,100,100,100,100,100
Bay Village,100,100,100,100,100,100
Beacon Hill,100,100,100,100,100,100
Brighton,100,100,100,100,100,100
Charlestown,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100
Dorchester,65,65,65,65,65,65
Downtown,9,9,9,9,9,9
East Boston,100,100,100,100,100,100


In [19]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 215 uniques categories.


In [20]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Comfort Food Restaurant', 'Sushi Restaurant',
       'Fried Chicken Joint', 'Chinese Restaurant', 'Korean Restaurant',
       'Japanese Restaurant', 'Vegetarian / Vegan Restaurant',
       'Rock Club', 'Bakery', 'Gastropub', 'Italian Restaurant',
       'Taco Place', 'Indian Restaurant', 'Board Shop', 'Bubble Tea Shop',
       'Liquor Store', 'Ice Cream Shop', 'Tea Room', 'Frozen Yogurt Shop',
       'Mediterranean Restaurant', 'Pizza Place', 'Yoga Studio',
       'Electronics Store', 'Hot Dog Joint', 'Athletics & Sports',
       'Grocery Store', 'Thrift / Vintage Store', 'Diner', 'Food Court',
       'Seafood Restaurant', 'Bar', 'Café', 'Vietnamese Restaurant',
       'Hockey Rink', 'Spa', 'Thai Restaurant', 'Burmese Restaurant',
       'Hotpot Restaurant', 'Food Truck', 'Gym', 'Department Store',
       'Afghan Restaurant', 'Gym / Fitness Center', 'Coffee Shop', 'Park',
       'Restaurant', 'Bagel Shop', 'Donut Shop', 'Food & Drink Shop',
       'Hardware Store'], dtype=objec

#### 6. Analyzing each neighborhood

In [21]:
# one hot encoding
df_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + df_onehot.columns[:-1].to_list()
df_onehot = df_onehot[fixed_columns]

print(df_onehot.shape)
df_onehot.head()

(1914, 216)


Unnamed: 0,Neighborhoods,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Allston,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Allston,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Allston,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Allston,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Allston,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
df_grouped = df_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(df_grouped.shape)
df_grouped

(22, 216)


Unnamed: 0,Neighborhoods,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Allston,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
1,Back Bay,0.01,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.01,0.03,0.0,0.0,0.01,0.0,0.0
2,Bay Village,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.01,0.0,0.0
3,Beacon Hill,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.0
4,Brighton,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
5,Charlestown,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0
6,Chinatown,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.03,0.0,0.0
7,Dorchester,0.0,0.0,0.0,0.015385,0.0,0.0,0.0,0.0,0.0,...,0.0,0.061538,0.0,0.0,0.0,0.0,0.0,0.0,0.015385,0.015385
8,Downtown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,East Boston,0.0,0.0,0.0,0.01,0.0,0.02,0.01,0.01,0.0,...,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0


In [23]:
len(df_grouped[df_grouped["Chinese Restaurant"] > 0])

13

In [24]:
df_cr = df_grouped[["Neighborhoods","Chinese Restaurant"]]
df_cr.head()

Unnamed: 0,Neighborhoods,Chinese Restaurant
0,Allston,0.02
1,Back Bay,0.0
2,Bay Village,0.01
3,Beacon Hill,0.0
4,Brighton,0.01


#### 7. Clustering neighborhoods

In [25]:
# set number of clusters
kclusters = 3

df_clustering = df_cr.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 0],
      dtype=int32)

In [26]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
df_merged = df_cr.copy()

# add clustering labels
df_merged["Cluster Labels"] = kmeans.labels_

In [27]:
df_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
df_merged.head()

Unnamed: 0,Neighborhood,Chinese Restaurant,Cluster Labels
0,Allston,0.02,0
1,Back Bay,0.0,1
2,Bay Village,0.01,2
3,Beacon Hill,0.0,1
4,Brighton,0.01,2


In [28]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
df_merged = df_merged.join(df.set_index("Neighborhoods"), on="Neighborhood")

print(df_merged.shape)
df_merged.head() # check the last columns!

(22, 5)


Unnamed: 0,Neighborhood,Chinese Restaurant,Cluster Labels,Latitude,Longitude
0,Allston,0.02,0,42.355434,-71.132127
1,Back Bay,0.0,1,42.350707,-71.07973
2,Bay Village,0.01,2,42.350011,-71.066948
3,Beacon Hill,0.0,1,42.358708,-71.067829
4,Brighton,0.01,2,42.350097,-71.156442


In [29]:
# sort the results by Cluster Labels
print(df_merged.shape)
df_merged.sort_values(["Cluster Labels"], inplace=True)
df_merged

(22, 5)


Unnamed: 0,Neighborhood,Chinese Restaurant,Cluster Labels,Latitude,Longitude
0,Allston,0.02,0,42.355434,-71.132127
12,Jamaica Plain,0.015625,0,42.30982,-71.12033
21,West Roxbury,0.020408,0,42.279265,-71.149497
3,Beacon Hill,0.0,1,42.358708,-71.067829
15,North End,0.0,1,42.365097,-71.054495
13,Mattapan,0.0,1,42.267566,-71.092427
11,Hyde Park,0.0,1,42.255654,-71.124496
20,West End,0.0,1,42.363919,-71.063899
1,Back Bay,0.0,1,42.350707,-71.07973
7,Dorchester,0.0,1,42.29732,-71.074495


In [30]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighborhood'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [31]:
# Cluster 0
df_merged.loc[df_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Chinese Restaurant,Cluster Labels,Latitude,Longitude
0,Allston,0.02,0,42.355434,-71.132127
12,Jamaica Plain,0.015625,0,42.30982,-71.12033
21,West Roxbury,0.020408,0,42.279265,-71.149497


In [32]:
# Cluster 1
df_merged.loc[df_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Chinese Restaurant,Cluster Labels,Latitude,Longitude
3,Beacon Hill,0.0,1,42.358708,-71.067829
15,North End,0.0,1,42.365097,-71.054495
13,Mattapan,0.0,1,42.267566,-71.092427
11,Hyde Park,0.0,1,42.255654,-71.124496
20,West End,0.0,1,42.363919,-71.063899
1,Back Bay,0.0,1,42.350707,-71.07973
7,Dorchester,0.0,1,42.29732,-71.074495
5,Charlestown,0.0,1,42.377875,-71.061996
8,Downtown,0.0,1,52.971149,-0.059809


In [33]:
# Cluster 2
df_merged.loc[df_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Chinese Restaurant,Cluster Labels,Latitude,Longitude
9,East Boston,0.01,2,42.375097,-71.039217
6,Chinatown,0.01,2,42.351329,-71.062623
2,Bay Village,0.01,2,42.350011,-71.066948
14,Mission Hill,0.01,2,42.33256,-71.103608
4,Brighton,0.01,2,42.350097,-71.156442
16,Roslindale,0.010753,2,42.291209,-71.124497
17,Roxbury,0.01,2,42.324843,-71.095016
18,South Boston,0.01,2,42.333431,-71.049495
19,South End,0.01,2,42.34131,-71.07723
10,Fenway Kenmore,0.01,2,42.344224,-71.094445


### 5. Conclusion

Since cluster 0 has a density of ~0.20 and cluster 2 has a density of ~0.10, it's probably a good idea to choose one of the neighborhoods in cluster 1 to open a new Chinese restaurant. Besides that, it's also a good idea to choose among Downtown, Back Bay, Beacon Hill and North End as these area have more population.