#### Introduction

## Definition of the problem. 

A business owner wants to open a new coffee shop in San Francisco. We will try to solve the problem using the foursquare location data in order to choose the most suitable neighborhood in the city for the purpose. 

The purpose is to recommend a neighborhood in San Francisco which is most suitable for opening a new coffee shop.The right location of the shop is key to it's success.
The solution is suitable to all kind of business owners which operates coffee shops or other different type of restaurants , bars, pubs and look for a location for the new one. 

The proposed solution is to gather the needed data about neighborhoods in the city and the data about all coffee shops in them. After this we will cluster the neighborhoods and we will chose the best one based on a criteria.


#### Data

web page : http://www.healthysf.org/bdi/outcomes/zipmap.htm consisting the list of neighborhoods in San Francisco.
Forusquare.com : location data from Foursquare about the venues and coffee shops in specific. 

For the purpose of the project we will scrap the data about the neighborhoods in San Francisco from a web page. After this we will add the geographic coordinates for each neighborhood. 
After this for each neighborhood will be retrieved the location data from Foursquare in order to cluster the neighborhoods.


#### Metodology 

First we web scraping the data about the neighborhoods in San Francisco and pass them to a dateframe. 
We add to each neighborhood the corresponding geographic coordinates. 
For each neighborhood the Foursquare location data for the nearest venues is gathered and pass to the existing dataset. 

Filtering only the records with Venue Category "Coffee shop" in order to create new dataset only with the coffee shops returned by Foursquare and the data about them.
We cluster the new dataset in 5 clusters and add the cluster label for each row in our table with coffee shops.

The algorithm used for the project is k-means which cluster the data based on the similarity of the values in it. 
It adds each neighborhood to a specific cluster which will help us to determine which cluster will be the most suitable to choose a neighborhood from. 



Fist install the needed libraries and packages. 

In [1]:
import pandas as pd
import requests
import folium
import random
!conda install -c conda-forge beautifulsoup4 lxml --yes 
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge geocoder --yes
from bs4 import BeautifulSoup
import json
import numpy as np
import urllib
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import geocoder 
print ("Libraries imported")

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blas-2.11                  |         openblas          10 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    scipy-1.3.2                |   py36h921218d_0        18.0 MB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    numpy-1.17.3               |   py36h95a1406_0         5.2 MB  conda-forge
    libcblas-3.8.0             |    

We get the data of the neighborhoods in San Francisco from a web page using BeautifulSoup.After this we read the data and pass it to a pandas data frame. We look at the 5 five records in our data. 

In [2]:
url = requests.get('http://www.healthysf.org/bdi/outcomes/zipmap.htm')

soup = BeautifulSoup(requests.get('http://www.healthysf.org/bdi/outcomes/zipmap.htm').text, 'lxml')
table = soup.find_all('table')
df = pd.read_html(str(table))
df = pd.DataFrame(df[4])

df.columns = df.iloc[0]
df = df.iloc[1:-1, :-1]
sf_data = df
sf_data.head()

Unnamed: 0,Zip Code,Neighborhood
1,94102,Hayes Valley/Tenderloin/North of Market
2,94103,South of Market
3,94107,Potrero Hill
4,94108,Chinatown
5,94109,Polk/Russian Hill (Nob Hill)


We retrieve the geo coordinates for each neighborhood based on the zip code. We check the new dataset. 

In [3]:
!pip install uszipcode
from uszipcode import SearchEngine

search = SearchEngine(simple_zipcode=True)

latitude = []
longitude = []

for index, row in sf_data.iterrows():
    zipcode = search.by_zipcode(row["Zip Code"]).to_dict()
    latitude.append(zipcode.get("lat"))
    longitude.append(zipcode.get("lng"))

sf_data["Latitude"] = latitude
sf_data["Longitude"] = longitude

sf_data.head()

Collecting uszipcode
[?25l  Downloading https://files.pythonhosted.org/packages/bc/94/1b908c6fe2008f0e913b0b2d97951aa76e00ec1044883c012afb2e477b4a/uszipcode-0.2.4-py2.py3-none-any.whl (378kB)
[K     |████████████████████████████████| 378kB 6.3MB/s eta 0:00:01
Collecting pathlib-mate (from uszipcode)
[?25l  Downloading https://files.pythonhosted.org/packages/ff/f2/a1e6044fe90784e7bbc05286f2e8616aa2ff167f7275f5a6f2df479092c0/pathlib_mate-0.0.15-py2.py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 34.0MB/s eta 0:00:01
Collecting autopep8 (from pathlib-mate->uszipcode)
[?25l  Downloading https://files.pythonhosted.org/packages/45/f3/24b437da561b6af4840c871fbbda32889ca304fc1f7b6cc3ada8b09f394a/autopep8-1.4.4.tar.gz (114kB)
[K     |████████████████████████████████| 122kB 35.2MB/s eta 0:00:01
[?25hCollecting pycodestyle>=2.4.0 (from autopep8->pathlib-mate->uszipcode)
[?25l  Downloading https://files.pythonhosted.org/packages/0e/0c/04a353e104d2f324f8ee5f4b320126

Unnamed: 0,Zip Code,Neighborhood,Latitude,Longitude
1,94102,Hayes Valley/Tenderloin/North of Market,37.78,-122.42
2,94103,South of Market,37.78,-122.41
3,94107,Potrero Hill,37.77,-122.39
4,94108,Chinatown,37.791,-122.409
5,94109,Polk/Russian Hill (Nob Hill),37.79,-122.42


 ## Foursquare data
We create our credential for the Foursquare API and set the location of San Francisco as addres.


In [4]:
from pandas.io.json import json_normalize
CLIENT_ID = 'ZNIGMWEGB3B2PRX1KKCUDR52MK2HBXMHHHXZUKK2D1KJDOIS' # your Foursquare ID
CLIENT_SECRET = 'FKC5BPFX5A3DLZJNDBC413YNIWXGWJI53FFEE3CFXO4MHENT' # your Foursquare Secret
VERSION = '20190531'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

address = 'San Francisco, SF'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)




Your credentails:
CLIENT_ID: ZNIGMWEGB3B2PRX1KKCUDR52MK2HBXMHHHXZUKK2D1KJDOIS
CLIENT_SECRET:FKC5BPFX5A3DLZJNDBC413YNIWXGWJI53FFEE3CFXO4MHENT
37.7792808 -122.4192363


In order to retrieve all venues around each neighborhood we create function which will get the top venues for each neighborhood from Foursquare and add the to a new dataset called sf_coffe_shops.


In [5]:
def getNearbyBars(names, latitudes, longitudes, radius=6000):
    bars_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        bars_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_bars = pd.DataFrame([item for bars_list in bars_list for item in bars_list])
    nearby_bars.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_bars)


sf_coffe_shops = getNearbyBars(names = sf_data['Neighborhood'],
                                   latitudes = sf_data['Latitude'],
                                   longitudes = sf_data['Longitude']
                                  )                                  
sf_coffe_shops.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Louise M. Davies Symphony Hall,37.777976,-122.420157,Concert Hall
1,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Herbst Theater,37.779548,-122.420953,Concert Hall
2,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Asian Art Museum,37.780178,-122.416505,Art Museum
3,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,War Memorial Opera House,37.778601,-122.420816,Opera House
4,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,SFJazz Center,37.77635,-122.421539,Jazz Club


We plot all venues returned from Foursquare in category Coffee Shop on a map to visualize the distribution.  


In [8]:
sf_coffe_shops = sf_coffe_shops.loc[sf_coffe_shops['Venue Category'] == 'Coffee Shop']
sf_coffe_shops.shape

(27, 7)

The returned results are 27. Further we will cluster only the neighborhoods in this dataset in which have a coffee shops. 


In [9]:

map_sf = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, venue, neighborhood in zip(sf_coffe_shops['Venue Latitude'], sf_coffe_shops['Venue Longitude'], sf_coffe_shops['Venue'], sf_coffe_shops['Neighborhood']):
    label = '{}, {}'.format(neighborhood, venue)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sf)  
    
map_sf


We are using k-means algorithm for clustering the neighborhoods based on the data in the dataset we have created. We choose to have 5 clusters and set the k to 5. After this we add a new column to our dataset with the information for the cluster labels for each neighborhood. 


In [10]:
k=5
sf_clstering = sf_coffe_shops.drop(['Neighborhood','Venue Latitude','Venue Longitude','Venue','Venue Category'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(sf_clstering)
kmeans.labels_
sf_coffe_shops.insert(0, 'Cluster Labels', kmeans.labels_)
sf_coffe_shops

Unnamed: 0,Cluster Labels,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
8,1,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Blue Bottle Coffee,37.776286,-122.416867,Coffee Shop
9,1,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Blue Bottle Coffee,37.77643,-122.423224,Coffee Shop
10,1,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Ritual Coffee Roasters,37.776476,-122.424281,Coffee Shop
29,1,Hayes Valley/Tenderloin/North of Market,37.78,-122.42,Sightglass Coffee,37.777001,-122.408519,Coffee Shop
34,1,South of Market,37.78,-122.41,Sightglass Coffee,37.777001,-122.408519,Coffee Shop
40,1,South of Market,37.78,-122.41,Blue Bottle Coffee,37.776286,-122.416867,Coffee Shop
79,1,Potrero Hill,37.77,-122.39,Blue Bottle Coffee,37.782497,-122.392982,Coffee Shop
85,1,Potrero Hill,37.77,-122.39,Sightglass Coffee,37.777001,-122.408519,Coffee Shop
95,1,Chinatown,37.791,-122.409,The Coffee Movement,37.794687,-122.410299,Coffee Shop
96,1,Chinatown,37.791,-122.409,Blue Bottle Coffee,37.792771,-122.404833,Coffee Shop



We plot the neighborhoods on a map as each cluster is in different color.


In [11]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_coffe_shops['Neighborhood Latitude'], sf_coffe_shops['Neighborhood Longitude'], sf_coffe_shops['Neighborhood'], sf_coffe_shops['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results

We check how many coffee shops are grouped to each cluster and choose the cluster with the second or third biggest number of coffee shops in it. Because the neighborhoods in the cluster  with high number of coffee shops will have big competition. The neighborhoods in the cluster with the second or the third biggest number will have existing coffee shops but not as many as the other cluster. 

We show the distribution of the coffee shops in each cluster.


In [12]:
sf_coffe_shops['Cluster Labels'].value_counts()

1    12
3     5
2     4
4     3
0     3
Name: Cluster Labels, dtype: int64


## Conclusion

We choose the cluster number 3 and inspect the neighborhoods in it. 

The preposition is the new coffee shop to be open in St. Francis Wood Neighborhood.

It must be noted that the performance of the above algorithm and methodology highly depends of the returned data from Foursquare. The project is showing the methodology of the data science approach for solving a problem using Foresquare location data.


In [13]:
final_result = sf_coffe_shops.loc[sf_coffe_shops['Cluster Labels'] == 3]
final_result

Unnamed: 0,Cluster Labels,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
189,3,Ingelside-Excelsior/Crocker-Amazon,37.72,-122.44,Pinhole Coffee,37.739591,-122.418991,Coffee Shop
206,3,Ingelside-Excelsior/Crocker-Amazon,37.72,-122.44,Four Barrel Coffee,37.728967,-122.40384,Coffee Shop
207,3,Ingelside-Excelsior/Crocker-Amazon,37.72,-122.44,Philz Coffee,37.751143,-122.438361,Coffee Shop
464,3,St. Francis Wood/Miraloma/West Portal,37.73,-122.46,Philz Coffee,37.751143,-122.438361,Coffee Shop
479,3,St. Francis Wood/Miraloma/West Portal,37.73,-122.46,Pinhole Coffee,37.739591,-122.418991,Coffee Shop
