# Capstone Project - The Battle of the Neighborhoods

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Introduction: Business Problem</a>

2.  <a href="#item2">Data</a>

3.  <a href="#item3">Methodology and Analysis</a>

4.  <a href="#item4">Results and Discussion</a>

5.  <a href="#item5">Conclusion</a>  
    </font>
    </div>

<a id="item1"></a>

## Introduction: Business Problem

Cafe ABC has been operating in Rosemead, CA since 2018. The cafe serves iconic Taiwanese dishes such as popcorn chicken, beef noodle soup, stinky tofu, minced meat over rice, and various flavors of boba drinks. The owners are considering opening a second location in Southern California due to the popularity of the first branch.

Some factors to consider in choosing a second location are: locations that don't have a high concentration of restaurants to avoid heightened competition; demographics of the neighborhood such that introducing a Taiwanese eatery won't be too unfamiliar.

We will leverage data science to identify a few cities that are suitable as a second location. This report will be useful for stakeholders interested in opening a Taiwanese restaurant or cafe in the Southern California region.

<a id="item2"></a>

## Data

We will define our area of interest to be cities in Los Angeles county in Southern California, which is the county that the original location belongs to. Based on the definition of our problem, factors that will influence the decision will include:

* How similar the demographics is compared to Rosemead, CA, representing how likely an Asian eatery will be successful cultural-wise
* Number of existing restaurants in the same city

The following data sources will be used to extract and generate the required information:

* Geospatial data for cities in the LA county will be sourced from Los Angeles GeoHub, an open-source data hub for location-based data
* Demographics data will be will be obtained by webscrapping sites such as Wikipedia, which is ultimately sourced from United States Census Bureau
* Number of restaurants and their types and locations will be obtained using Foursquare API

### Geospatial Data

First we install the necessary packages.

In [227]:
!pip install bs4
from bs4 import BeautifulSoup # to help with webscrapping

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Imports and installations done.')

Imports and installations done.


Since our goal is to identify cities suitable for a second location for the cafe, we need to obtain a list of cities in the Los Angeles county, along with their geospatial data such as longitudes and latitudes. Los Angeles GeoHub is an online portal with public access to the city's location-based data. One of the datasets include a list of city halls in LA county. We will extract the city name, zip code, longitude, and latitude information from this dataset.

In [2]:
!wget -q -O 'geo_data.json' https://opendata.arcgis.com/datasets/db2c52f3ddc945cb988c393deac1d487_67.geojson
print('Data downloaded!')

Data downloaded!


In [3]:
with open('geo_data.json') as json_data:
    losangeles_data = json.load(json_data)

We see that within this dataset, the relevant data (city, zip code, latitude, longitude) are all in the features key, so we define a new variable for this data.

In [4]:
geospatial_data = losangeles_data['features']
geospatial_data[0]

{'type': 'Feature',
 'properties': {'OBJECTID': 2576,
  'source': 'City of Manhattan Beach',
  'ext_id': '',
  'cat1': 'Government',
  'cat2': 'City Halls',
  'cat3': None,
  'org_name': 'City of Manhattan Beach',
  'Name': 'City Of Manhattan Beach',
  'addrln1': '1400 Highland Ave',
  'addrln2': None,
  'city': 'Manhattan Beach',
  'state': 'CA',
  'hours': None,
  'phones': 'FAX (310) 802-5001,  Service/Intake and Administration (310) 802-5000, City Clerk Service/Intake (310) 802-5056, Permits Service/Intake (310) 802-5536, City Attorney Service/Intake (310) 802-5061, Business Licenses Service/Intake (310) 802-5558, Permits S',
  'url': 'http://www.citymb.info',
  'info1': None,
  'info2': None,
  'post_id': 2815,
  'description': '',
  'zip': '90266',
  'link': 'http://egis3.lacounty.gov/lms/?p=2815',
  'use_type': 'publish',
  'latitude': 33.88728147,
  'longitude': -118.41060698,
  'date_updated': '2011-02-09T11:08:51Z',
  'email': None,
  'dis_status': None,
  'POINT_X': 6437047.

We now want to transform the json data into a *pandas* dataframe.

In [5]:
# define dataframe columns
column_names = ['City','ZipCode','Latitude','Longitude']

# instantiate the dataframe
cities = pd.DataFrame(columns=column_names)

In [6]:
# loop through the data and fill the dataframe one row at a time
for data in geospatial_data:
    city = data['properties']['city'] 
    zipcode = data['properties']['zip']
    latitude = data['properties']['latitude']
    longitude = data['properties']['longitude']
    
    cities = cities.append({'City': city,
                            'ZipCode': zipcode,
                            'Latitude': latitude,
                            'Longitude': longitude}, ignore_index=True)

In [8]:
# delete the row with index 4 since it has no value
cities = cities.drop([4],axis=0)

In [9]:
cities.head()

Unnamed: 0,City,ZipCode,Latitude,Longitude
0,Manhattan Beach,90266,33.887281,-118.410607
1,Cerritos,90703,33.867227,-118.063873
2,Claremont,91711,34.095726,-117.716532
3,Burbank,91502,34.18182,-118.30789
5,Agoura Hills,91301,34.144303,-118.777612


We now have a dataframe with city names, zip code, longitudes and latitudes.

In [10]:
cities.shape

(91, 4)

### Demographics Data

We will obtain demographics data by webscrapping a Wikipedia page.

In [11]:
#The below url contains html tables with data on cities in Los Angeles county
url = "https://en.wikipedia.org/wiki/Demographics_of_Los_Angeles_County"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")

In [12]:
#find all html tables in the web page
tables = soup.find_all('table')
len(tables)

4

We create a dataframe and loop through the data to fill in each row.

In [13]:
demographics_data = pd.DataFrame(columns=["City","Total Population","White","African American","Native American","Asian","Pacific Islander","Other","Two or More","Hispanic"])

for row in tables[3].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        city = col[0].text.replace('\n','')
        population = col[1].text.replace(',','')
        white = col[2].text.replace(',','')
        african_american = col[3].text.replace(',','')
        native_american = col[4].text.replace(',','')
        asian = col[5].text.replace(',','')
        pacific_islander = col[6].text.replace(',','')
        other = col[7].text.replace(',','')
        two_or_more = col[8].text.replace(',','')
        hispanic = col[9].text.replace(',','').replace('\n','')
        demographics_data = demographics_data.append({"City": city, "Total Population": population, "White": white,"African American": african_american,
                                                      "Native American": native_american,"Asian": asian,"Pacific Islander": pacific_islander,"Other": other,
                                                      "Two or More": two_or_more,"Hispanic":hispanic},ignore_index=True)

In [14]:
demographics_data[0:10]

Unnamed: 0,City,Total Population,White,African American,Native American,Asian,Pacific Islander,Other,Two or More,Hispanic
0,The County,TotalPopulation,White,AfricanAmerican,NativeAmerican,Asian,PacificIslander,otherraces,two ormore races,Hispanicor Latino(of any race)
1,Los Angeles County,9818605,4936599,856874,72828,1346865,26094,2140632,438713,4687889
2,,100%,50.3%,8.7%,0.7%,13.7%,0.3%,21.8%,4.5%,47.7%
3,Incorporatedcity,TotalPopulation,White,AfricanAmerican,NativeAmerican,Asian,PacificIslander,otherraces,two ormore races,Hispanicor Latino(of any race)
4,Agoura Hills,20330,17147,267,51,1521,24,590,730,1936
5,Alhambra,83089,23521,1281,538,43957,81,10805,2906,28582
6,Arcadia,56364,18191,681,186,33353,16,2352,1585,6799
7,Artesia,16522,6446,589,94,6131,40,2630,592,5910
8,Avalon,3728,2313,20,22,49,13,1137,174,2079
9,Azusa,46361,26715,1499,562,4054,87,11270,2174,31328


We see that the data for each city begins with row with index 4, and there are a few rows with sub-headers that we should drop.

In [15]:
demographics_data = demographics_data.drop([0,1,2,3,92,146,147])
demographics_data[0:5]

Unnamed: 0,City,Total Population,White,African American,Native American,Asian,Pacific Islander,Other,Two or More,Hispanic
4,Agoura Hills,20330,17147,267,51,1521,24,590,730,1936
5,Alhambra,83089,23521,1281,538,43957,81,10805,2906,28582
6,Arcadia,56364,18191,681,186,33353,16,2352,1585,6799
7,Artesia,16522,6446,589,94,6131,40,2630,592,5910
8,Avalon,3728,2313,20,22,49,13,1137,174,2079


In order to compare among the cities, it's better to convert the absolute number of each demographic to a ratio. First we need to convert every column except for the City column to integers. 

In [16]:
demographics_data["Total Population"] = pd.to_numeric(demographics_data["Total Population"])
demographics_data["White"] = pd.to_numeric(demographics_data["White"])
demographics_data["African American"] = pd.to_numeric(demographics_data["African American"])
demographics_data["Native American"] = pd.to_numeric(demographics_data["Native American"])
demographics_data["Asian"] = pd.to_numeric(demographics_data["Asian"])
demographics_data["Pacific Islander"] = pd.to_numeric(demographics_data["Pacific Islander"])
demographics_data["Other"] = pd.to_numeric(demographics_data["Other"])
demographics_data["Two or More"] = pd.to_numeric(demographics_data["Two or More"])
demographics_data["Hispanic"] = pd.to_numeric(demographics_data["Hispanic"])

In [17]:
demographics_data.dtypes

City                object
Total Population     int64
White                int64
African American     int64
Native American      int64
Asian                int64
Pacific Islander     int64
Other                int64
Two or More          int64
Hispanic             int64
dtype: object

Now that the columns are integers, we perform the division and replace the original columns.

In [18]:
demographics_data["White"] = demographics_data["White"] / demographics_data["Total Population"]
demographics_data["African American"] = demographics_data["African American"] / demographics_data["Total Population"]
demographics_data["Native American"] = demographics_data["Native American"] /demographics_data["Total Population"]
demographics_data["Asian"] = demographics_data["Asian"] / demographics_data["Total Population"]
demographics_data["Pacific Islander"] = demographics_data["Pacific Islander"] / demographics_data["Total Population"]
demographics_data["Other"] = demographics_data["Other"] / demographics_data["Total Population"]
demographics_data["Two or More"] = demographics_data["Two or More"]/ demographics_data["Total Population"]
demographics_data["Hispanic"] = demographics_data["Hispanic"] / demographics_data["Total Population"]

In [19]:
demographics_data[0:5]

Unnamed: 0,City,Total Population,White,African American,Native American,Asian,Pacific Islander,Other,Two or More,Hispanic
4,Agoura Hills,20330,0.843433,0.013133,0.002509,0.074816,0.001181,0.029021,0.035908,0.095229
5,Alhambra,83089,0.283082,0.015417,0.006475,0.529035,0.000975,0.130041,0.034975,0.343993
6,Arcadia,56364,0.322741,0.012082,0.0033,0.591743,0.000284,0.041729,0.028121,0.120627
7,Artesia,16522,0.390146,0.035649,0.005689,0.371081,0.002421,0.159182,0.035831,0.357705
8,Avalon,3728,0.62044,0.005365,0.005901,0.013144,0.003487,0.304989,0.046674,0.557672


Now we have the dataframe that lists each city in LA county, along with demographics information as percentages.

In [20]:
demographics_data.shape

(141, 10)

We will now join the geospatial and demographics table so that we have the cities that we have both data for.

In [21]:
cities.head()

Unnamed: 0,City,ZipCode,Latitude,Longitude
0,Manhattan Beach,90266,33.887281,-118.410607
1,Cerritos,90703,33.867227,-118.063873
2,Claremont,91711,34.095726,-117.716532
3,Burbank,91502,34.18182,-118.30789
5,Agoura Hills,91301,34.144303,-118.777612


In [22]:
result = pd.merge(cities,demographics_data,how='inner', on='City')

In [23]:
result.head()

Unnamed: 0,City,ZipCode,Latitude,Longitude,Total Population,White,African American,Native American,Asian,Pacific Islander,Other,Two or More,Hispanic
0,Manhattan Beach,90266,33.887281,-118.410607,35135,0.844912,0.008254,0.001679,0.08604,0.001395,0.011641,0.046079,0.069446
1,Cerritos,90703,33.867227,-118.063873,49041,0.231255,0.069085,0.002671,0.619135,0.002814,0.037153,0.037887,0.119961
2,Claremont,91711,34.095726,-117.716532,34926,0.706236,0.047271,0.004925,0.130676,0.001088,0.057693,0.05211,0.198105
3,Burbank,91502,34.18182,-118.30789,103340,0.727376,0.02516,0.004703,0.116189,0.000861,0.077405,0.048307,0.24492
4,Agoura Hills,91301,34.144303,-118.777612,20330,0.843433,0.013133,0.002509,0.074816,0.001181,0.029021,0.035908,0.095229


In [24]:
result.shape

(89, 13)

In [25]:
neighborhoods = result[["City","ZipCode","Latitude","Longitude"]]
neighborhoods.head()

Unnamed: 0,City,ZipCode,Latitude,Longitude
0,Manhattan Beach,90266,33.887281,-118.410607
1,Cerritos,90703,33.867227,-118.063873
2,Claremont,91711,34.095726,-117.716532
3,Burbank,91502,34.18182,-118.30789
4,Agoura Hills,91301,34.144303,-118.777612


Let's see what the geospatial data for Rosemead looks like:

In [27]:
neighborhoods.loc[neighborhoods.City=='Rosemead']

Unnamed: 0,City,ZipCode,Latitude,Longitude
59,Rosemead,91770,34.080568,-118.076757


### Foursquare Data

Now, we'll use Foursquare API to get information on restaurants in each city. We are interested in vanues in the 'food' category, and we will focus on venues that are close to the city center, i.e. the location of the city hall which we obtained earlier. Let's first obtain the dataset for Rosemead, then we will obtain the location data for remaining population after we narrow down the cities into those that are similar to Rosemead, CA in terms of demographics

Foursquare credentials are defined in hidden cell bellow.

In [26]:
CLIENT_ID = 'PDXMVAGMDJ5MSHVNC0QYWC2C2U5KFR2KFKUIUEG5YB0X0J1X' # your Foursquare ID
CLIENT_SECRET = 'K5I0OWVEYFUUR02DJCUTZM0XFRPOZWJGGPCYRZUP0AMPYOM1' # your Foursquare Secret
ACCESS_TOKEN = 'K3S0U5BQZG2QRDJ15U3Z3TAYJXGLPXVCIWRKZD5KKTA4GT12' # your FourSquare Access Token
VERSION = '20201231'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PDXMVAGMDJ5MSHVNC0QYWC2C2U5KFR2KFKUIUEG5YB0X0J1X
CLIENT_SECRET:K5I0OWVEYFUUR02DJCUTZM0XFRPOZWJGGPCYRZUP0AMPYOM1


Let's first get the Foursquare data for the city of Rosemead. Radius is set to 1 mile from the city hall coordinates (i.e. about 1600 meters).

In [28]:
latitudes = neighborhoods.loc[59,'Latitude']
longitudes = neighborhoods.loc[59,'Longitude']

In [40]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
radius=1600
limit=200

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, VERSION, latitudes, longitudes, food_category, radius, limit)
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues)

  if __name__ == '__main__':


In [42]:
nearby_venues[0:2]

Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.address,venue.location.crossStreet,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,...,venue.photos.count,venue.photos.groups,venue.location.neighborhood,venue.venuePage.id,venue.delivery.id,venue.delivery.url,venue.delivery.provider.name,venue.delivery.provider.icon.prefix,venue.delivery.provider.icon.sizes,venue.delivery.provider.icon.name
0,e-0-4b6b2a9bf964a52003f72be3-0,0,"[{'summary': 'This spot is popular', 'type': '...",4b6b2a9bf964a52003f72be3,In-N-Out Burger,4242 Rosemead Blvd,at Mission Dr,34.083733,-118.073195,"[{'label': 'display', 'lat': 34.08373322493041...",...,0,[],,,,,,,,
1,e-0-4b76f48df964a520366e2ee3-1,0,"[{'summary': 'This spot is popular', 'type': '...",4b76f48df964a520366e2ee3,Jim's Famous Quarterpound Burger,8749 Valley Blvd,btwn Bartlett & Muscatel Ave,34.081002,-118.07901,"[{'label': 'display', 'lat': 34.08100180303224...",...,0,[],,,,,,,,


In [41]:
nearby_venues.shape

(100, 28)

We see that there are 100 restaurants in Rosemead, CA that are a mile from city center.

Now that we have gathered the basic data required, we are ready to perform some analysis. 

<a id="item3"></a>

## Methodology and Analysis

In this project, we will identify cities in Southern California that may be suitable for a second location for the Taiwanese eatery with its original location in Rosemead, CA. 

We will first narrow our selections to cities that are similar to Rosemead, CA from a demographics perspective, using k-means clustering to group cities in LA county into clusters. The data used will be the latest census results. 

Next, we will obtain Foursquares venues to identify the concentration of restaurants a mile away from city center. We will use a heatmap to compare the concentration of restaurants to rank the cities. 

In [43]:
!pip install folium
import folium # map rendering library



First, create a map of Los Angeles with each city as a marker

In [262]:
# create map of LA using latitude and longitude values
latitude=34.0522
longitude=-118.2437
map_la = folium.Map(location=[latitude,longitude],zoom_start=10)

# add markers to map
for lat, lng, label in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['City']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

### Clustering by Demographics

We want to explore which neighborhoods are most similar to Rosemead from a demographics perspective, such that introducing a Taiwanese restaurant would not be too unfamiliar.

In [45]:
la_demo_grouped_clustering = result[["White","African American","Native American","Asian","Pacific Islander","Other","Two or More","Hispanic"]]
la_demo_grouped_clustering.head()

Unnamed: 0,White,African American,Native American,Asian,Pacific Islander,Other,Two or More,Hispanic
0,0.844912,0.008254,0.001679,0.08604,0.001395,0.011641,0.046079,0.069446
1,0.231255,0.069085,0.002671,0.619135,0.002814,0.037153,0.037887,0.119961
2,0.706236,0.047271,0.004925,0.130676,0.001088,0.057693,0.05211,0.198105
3,0.727376,0.02516,0.004703,0.116189,0.000861,0.077405,0.048307,0.24492
4,0.843433,0.013133,0.002509,0.074816,0.001181,0.029021,0.035908,0.095229


In [46]:
# Run k-means to cluster the neighborhoods into 5 clusters
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_demo_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 3, 0, 0, 4, 0, 4, 0, 2, 4], dtype=int32)

In [47]:
# create a new dataframe that includes the cluster as well as latitude and longitude data
neighborhoods.insert(neighborhoods.shape[1],'Cluster Labels', kmeans.labels_)

In [48]:
neighborhoods.head()

Unnamed: 0,City,ZipCode,Latitude,Longitude,Cluster Labels
0,Manhattan Beach,90266,33.887281,-118.410607,4
1,Cerritos,90703,33.867227,-118.063873,3
2,Claremont,91711,34.095726,-117.716532,0
3,Burbank,91502,34.18182,-118.30789,0
4,Agoura Hills,91301,34.144303,-118.777612,4


In [49]:
neighborhoods.loc[neighborhoods.City=='Rosemead']

Unnamed: 0,City,ZipCode,Latitude,Longitude,Cluster Labels
59,Rosemead,91770,34.080568,-118.076757,3


We can see that Rosemead, the city of the cafe's first location, belongs to cluster 3. 

We then output all cities that belong to cluster 3:

In [50]:
cluster3 = neighborhoods.loc[neighborhoods['Cluster Labels'] ==3,]
cluster3

Unnamed: 0,City,ZipCode,Latitude,Longitude,Cluster Labels
1,Cerritos,90703,33.867227,-118.063873,3
48,Monterey Park,91754,34.059339,-118.11885,3
59,Rosemead,91770,34.080568,-118.076757,3
62,San Gabriel,91776,34.109092,-118.111736,3
63,San Marino,91108,34.121223,-118.105728,3
69,Temple City,91780,34.10769,-118.057863,3
72,Walnut,91789,34.026361,-117.842324,3
76,Alhambra,91801,34.092557,-118.127123,3
77,Arcadia,91007,34.137636,-118.03865,3
78,Artesia,90701,33.860057,-118.07995,3


These 10 cities (excluding Rosemead) are the candidates for a second location if we first filter by taking into account demographic similarities. Now we visualize the clustering results by marking each cluster with a different color.

In [263]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.Marker([34.080568,-118.076757], popup='Rosemead, CA').add_to(map_clusters)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, city, cluster in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['City'], neighborhoods['Cluster Labels']):
    label = folium.Popup(str(city) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Cluster 3, the group of cities that are most similar to Rosemead, CA demographics-wise, are labeled in green. It looks like there can be three sub-cluster groups for us to create heatmaps for.

### Obtaining Venue Data for Cluster 3

From the map above, we can further divide the candidates in Cluster 3 into 3 sub-groups. Let's first look at the cities closest to Rosemead, CA. 

In [64]:
candidates1 = cluster3.drop([1,59,72,78,86])
candidates1

Unnamed: 0,City,ZipCode,Latitude,Longitude,Cluster Labels
48,Monterey Park,91754,34.059339,-118.11885,3
62,San Gabriel,91776,34.109092,-118.111736,3
63,San Marino,91108,34.121223,-118.105728,3
69,Temple City,91780,34.10769,-118.057863,3
76,Alhambra,91801,34.092557,-118.127123,3
77,Arcadia,91007,34.137636,-118.03865,3


In [158]:
def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def get_venues_near_location(lat, lon, category, CLIENT_ID, CLIENT_SECRET, radius, limit):
    VERSION = '20201231'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, lat, lon, category, radius, limit)
    try:
        results1 = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'], 
                   item['venue']['name'], 
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng'])) for item in results1]
    except:
        venues = []
    return venues

In [164]:
def get_restaurants(lats, lons):
    restaurants = {}
    location_restaurants = []
    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        venues = get_venues_near_location(lat, lon, food_category, CLIENT_ID, CLIENT_SECRET, radius=1600, limit=200)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1])
            area_restaurants.append(restaurants)
            restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, location_restaurants

Now we can run the functions for the first group of candidates.

In [143]:
latitudes1 = candidates1['Latitude']
longitudes1 = candidates1['Longitude']

In [165]:
location_restaurants1 = []
restaurants1, location_restaurants1 = get_restaurants(latitudes1, longitudes1)

Obtaining venues around candidate locations: . . . . . . done.


We'll visualize the locations of the obtained venues using Folium maps.

In [264]:
map1 = folium.Map(location=[34.080568,-118.076757],zoom_start = 13) 
folium.Marker([34.080568,-118.076757], popup='Rosemead, CA').add_to(map1)
for res in restaurants1.values():
    lat = res[2]; lon = res[3]
    color = 'red'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map1)
map1

Now, let's create a heatmap to better visualize the concentration of restaurants. 

In [170]:
from folium import plugins
from folium.plugins import HeatMap

In [265]:
restaurant_latlons1 = [[res[2], res[3]] for res in restaurants1.values()]

map1heat = folium.Map(location=[34.107690,-118.057863],zoom_start = 12) #start the map centered at Rosemead, CA
folium.Marker([34.080568,-118.076757], popup='Rosemead, CA').add_to(map1heat)
HeatMap(restaurant_latlons1).add_to(map1heat)
map1heat

It looks like the city of Temple City and Montery Park already have a high concentration of restaurants, but parts of Alhambra and San Marino could still have potential for the cafe's second location.

We still have four other cities in our cluster 3 from the demographics analysis to obtain restaurant information for.

In [185]:
candidates2 = cluster3.drop([48,59,62,63,69,72,76,77,86])
candidates3 = cluster3.drop([1,48,59,62,63,69,76,77,78])

In [186]:
candidates2

Unnamed: 0,City,ZipCode,Latitude,Longitude,Cluster Labels
1,Cerritos,90703,33.867227,-118.063873,3
78,Artesia,90701,33.860057,-118.07995,3


In [192]:
candidates3

Unnamed: 0,City,ZipCode,Latitude,Longitude,Cluster Labels
72,Walnut,91789,34.026361,-117.842324,3
86,Diamond Bar,91765,33.999321,-117.830237,3


In [184]:
latitudes2 = candidates2['Latitude']
longitudes2 = candidates2['Longitude']
latitudes3 = candidates3['Latitude']
longitudes3 = candidates3['Longitude']
restaurants2, location_restaurants2 = get_restaurants(latitudes2, longitudes2)
restaurants3, location_restaurants3 = get_restaurants(latitudes3, longitudes3)

Obtaining venues around candidate locations: . . done.
Obtaining venues around candidate locations: . . done.


We will create maps for the obtain venues.

In [242]:
# create a map centered at Artesia  
map2 = folium.Map(location=[33.860057,-118.079950],zoom_start = 14) 
for res in restaurants2.values():
    lat = res[2]; lon = res[3]
    color = 'red'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map2)

In [254]:
# create a map centered in Walnut
map3 = folium.Map(location=[34.026361,-117.842324],zoom_start = 13) 
for res in restaurants3.values():
    lat = res[2]; lon = res[3]
    color = 'red'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map3)

In [266]:
map2

In [267]:
map3

We will also create the heatmaps.

In [268]:
restaurant_latlons2 = [[res[2], res[3]] for res in restaurants2.values()]

map2heat = folium.Map(location=[33.860057,-118.079950],zoom_start = 14) 
HeatMap(restaurant_latlons2).add_to(map2heat)
map2heat

In [269]:
restaurant_latlons3 = [[res[2], res[3]] for res in restaurants3.values()]

map3heat = folium.Map(location=[34.026361,-117.842324],zoom_start = 13) 
HeatMap(restaurant_latlons3).add_to(map3heat)
map3heat

Let's also create a heatmap with all 10 candidates.

In [215]:
all_restaurant_latlons = []
all_restaurant_latlons = restaurant_latlons1 + restaurant_latlons2 + restaurant_latlons3

In [270]:
heatmap = folium.Map(location=[33.9792,-118.0328],zoom_start = 11) 
folium.Marker([34.080568,-118.076757], popup='Rosemead, CA').add_to(heatmap)
HeatMap(all_restaurant_latlons).add_to(heatmap)
heatmap

It seems that the Diamond Bar and Walnut sub-cluster has slightly less concentrated number of restaurants.

<a id="item4"></a>

## Results and Discussion

In our analysis, we first narrow our area of interest to cities that are similar to Rosemead, CA from a demographics perspective. The reason for doing this is because certain traits of Taiwanese eatery, most famously the stinky tofu, does not have widespread popularity among Americans. We would want to establish a second location in a neighborhood that would be open to such exotic foods.

Our analysis shows that there are several cities in LA county that are very similar to Rosemead, CA from a demographics perspective. These cities can represent a safer choice for a second cafe location, as a similar demographics means that most of the residents in the city could already be familiar with Taiwanese cuisins and would be willing to visit a Taiwanese cafe. 

Among these candidates, we created heatmaps for existing restaurants in the cities to get a sense of the competition. From the results, it looks like the cities of Walnut and Diamond exhibit the lowest concentration, and therefore, these two cities are the best choices in that respect. After those two cities, parts of Alhambra and San Marino could still have potential for the cafe's second location. The sub-cluster of Artesia and Cerritos appear to have the highest concentration of existing restaurants, thus would be placed last. 

These inferences, of course, does not imply that these cities are actually the optimal locations. The purpose of this analysis was to identify a few potential choices based on the assumptions and criteria set forth (i.e. demographics and concentration of restaurants). It is entirely possible that some cities, despite not having a similar demographics as Rosemead, CA, would welcome a Taiwanese eatery due to its uniqueness, but of course, the stakeholders would have to bear the risk in mind. The recommended cities therefore only serve as a starting point for a more detailed analysis that takes into account more criteria. 

<a id="item5"></a>

## Conclusion

Purpose of this project was to identify cities in the LA county that could present good opportunities for the owners of Cafe ABC to open a second location in. With the original location in Rosemead, CA having flourished in the past few years, we try to cluster the cities in LA county to find those that are similar to Rosemead in demographics. By using k-means clustering, we found 10 cities to further analyze. We then use Foursquare API to generate heat maps that allow us to inspect the concentration of restaurants in these cities. We then were able to identify the cities of Walnut and Diamond Bar being the most optimal cities as they are similar to Rosemead in terms of demographics and have a relatively lower concentration of restaurants. 

Final decission on optimal restaurant location will be made by stakeholders taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, etc.