# IBM Applied Data Science Capstone Course by Coursera (CCalamera)
## Week 5 Final Report

### Opening a New Music Venue in Staten Island, New York

Staten Island, New York used to be home to a bunch of fantastic music venues when I was younger and in a fun cover band.  The borough does not quite have the same musical impact as it used to, so it would be interesting to see if a music venue could be opened and survive on Staten Island today.

I believe it would be best if we look for areas with colleges, coffee shops and restaurants that could attract music lovers to a small coffeehouse-type venue on weekday nights and weekends.

Some thoughts on putting together this type of project:

1. Build a dataframe of neighborhoods in Staten Island, New York by web scraping the data from Wikipedia page
2. Get the geographical coordinates of the neighborhoods
3. Obtain the venue data for the neighborhoods from Foursquare API
4. Explore and cluster the neighborhoods
5. Select the best cluster/neighborhoods to open a new coffeehouse-type music venue!

In [40]:
!pip install beautifulsoup4
!pip install lxml
!pip install geocoder
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import geocoder # to get coordinates

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Folium installed
Libraries imported.


## Let's scrape the Wikipedia page for Staten Island into a dataframe format for our work

In [41]:
# send the GET request
source = requests.get('https://en.wikipedia.org/wiki/List_of_Staten_Island_neighborhoods').text

In [42]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(source, 'html.parser')

In [43]:
# create a list to store neighborhood data
neighborhoodList = []

In [44]:
# append the data into the list
for row in soup.find_all("div", class_="mw-parser-output")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [45]:
# create a new DataFrame from the list
si_df = pd.DataFrame({"Neighborhood": neighborhoodList})

si_df.head()

Unnamed: 0,Neighborhood
0,Annadale
1,Arden Heights
2,Arlington
3,Arrochar
4,Bay Terrace


In [46]:
# print the number of rows of the dataframe
si_df.shape

(151, 1)

# Let's look at getting some of the geographical coordinates for our neighborhoods

In [47]:
# define a function to get coordinates:
def get_latlng(neighborhood):
    # initialize your variable to None:
    lat_lng_coords = None
    # loop until you get the coordinates:
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Staten Island, New York'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [48]:
# call the function to get the coordinates and store to a new list
coords = [ get_latlng(neighborhood) for neighborhood in si_df["Neighborhood"].tolist() ]

Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


In [49]:
coords

[[40.54920585567783, -74.17471027206285],
 [40.55988912771604, -74.1987876596362],
 [40.63718816631124, -74.16746062705108],
 [40.64242000000007, -74.07526999999999],
 [40.55452574359662, -74.13585209231593],
 [40.61059356999272, -74.17965179803514],
 [40.504025650927595, -74.2432936296758],
 [40.64242000000007, -74.07526999999999],
 [40.62122100000001, -74.12915204536496],
 [40.54941463169402, -74.21683765550642],
 [40.50430584927211, -74.24434442608066],
 [40.61238824039109, -74.0720961742667],
 [40.59861039506885, -74.10096738887606],
 [40.58799912088246, -74.10066030168086],
 [40.64242000000007, -74.07526999999999],
 [40.61501987643392, -74.10035265223087],
 [40.55725144521767, -74.16715049999999],
 [40.60403256961371, -74.1003419567669],
 [40.60659000000004, -74.06063999999998],
 [40.62531994777578, -74.15070002943034],
 [40.64242000000007, -74.07526999999999],
 [40.60223746946969, -74.08402151178691],
 [40.54790743994775, -74.14344329329849],
 [40.62931200000001, -74.11077],
 [40

In [50]:
# create a temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [51]:
# merge the coordinates into the original dataframe
si_df['Latitude'] = df_coords['Latitude']
si_df['Longitude'] = df_coords['Longitude']

In [52]:
# check out the neighborhoods and the coordinates
pd.set_option('display.max_rows', None)
print(si_df.shape)
si_df

(151, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Annadale,40.549206,-74.17471
1,Arden Heights,40.559889,-74.198788
2,Arlington,40.637188,-74.167461
3,Arrochar,40.64242,-74.07527
4,Bay Terrace,40.554526,-74.135852
5,Bloomfield,40.610594,-74.179652
6,Brighton Heights,40.504026,-74.243294
7,Bulls Head,40.64242,-74.07527
8,Castleton Corners,40.621221,-74.129152
9,Charleston,40.549415,-74.216838


# Some neighborhoods need additional parsing out, as the code picked up some errant data from the scraping

In [53]:
si_df = si_df.drop(si_df.index[67:151])

In [54]:
si_df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Annadale,40.549206,-74.17471
1,Arden Heights,40.559889,-74.198788
2,Arlington,40.637188,-74.167461
3,Arrochar,40.64242,-74.07527
4,Bay Terrace,40.554526,-74.135852
5,Bloomfield,40.610594,-74.179652
6,Brighton Heights,40.504026,-74.243294
7,Bulls Head,40.64242,-74.07527
8,Castleton Corners,40.621221,-74.129152
9,Charleston,40.549415,-74.216838


In [55]:
# check out the neighborhoods and the coordinates
print(si_df.shape)
si_df

(67, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Annadale,40.549206,-74.17471
1,Arden Heights,40.559889,-74.198788
2,Arlington,40.637188,-74.167461
3,Arrochar,40.64242,-74.07527
4,Bay Terrace,40.554526,-74.135852
5,Bloomfield,40.610594,-74.179652
6,Brighton Heights,40.504026,-74.243294
7,Bulls Head,40.64242,-74.07527
8,Castleton Corners,40.621221,-74.129152
9,Charleston,40.549415,-74.216838


In [56]:
# save the DataFrame as CSV file
si_df.to_csv("si_df.csv", index=False)

# Create a map of Staten Island with Neighborhoods Superimposed on top

In [57]:
# get the coordinates of Staten Island
address = 'Staten Island, New York'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Staten Island, New York {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Staten Island, New York 40.5834557, -74.1496048.


In [58]:
# create map of Staten Island using latitude and longitude values
map_si = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(si_df['Latitude'], si_df['Longitude'], si_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_si)  
    
map_si

In [59]:
# save the map as HTML file
map_si.save('map_si.html')

# Let's use the FourSquare API to start exploring the neighborhoods in Staten Island, NY

In [60]:
CLIENT_ID = 'LQERMGGYJALP5SOUVTKRBXGKP3RU33ILLX0KWJLUG52MFKRY' # your Foursquare ID
CLIENT_SECRET = '3P1VIPQZ3GZ3HI1QFCOX0YHZLM0GAFWMC415F4JCWRRHCVME' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LQERMGGYJALP5SOUVTKRBXGKP3RU33ILLX0KWJLUG52MFKRY
CLIENT_SECRET:3P1VIPQZ3GZ3HI1QFCOX0YHZLM0GAFWMC415F4JCWRRHCVME


In [61]:
si_df.loc[0, 'Neighborhood']

'Annadale'

In [62]:
neighborhood_latitude = si_df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = si_df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = si_df.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Annadale are 40.54920585567783, -74.17471027206285.


# Let's look at some of the venues that are in the area of Staten Island, NY

In [63]:
address = 'Staten Island, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

40.5834557 -74.1496048


In [64]:
radius = 5000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(si_df['Latitude'], si_df['Longitude'], si_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
        # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [65]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(6700, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Annadale,40.549206,-74.17471,Pastosa Ravioli,40.54531,-74.165364,Gourmet Shop
1,Annadale,40.549206,-74.17471,Campania Coal Fired Pizza,40.543206,-74.164033,Pizza Place
2,Annadale,40.549206,-74.17471,Ralph's Ices,40.559805,-74.169273,Ice Cream Shop
3,Annadale,40.549206,-74.17471,Annadale Diner,40.542079,-74.177325,Diner
4,Annadale,40.549206,-74.17471,Holiday Beverage,40.542539,-74.165401,Liquor Store


In [66]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Annadale,100,100,100,100,100,100
Arden Heights,100,100,100,100,100,100
Arlington,100,100,100,100,100,100
Arrochar,100,100,100,100,100,100
Bay Terrace,100,100,100,100,100,100
Bloomfield,100,100,100,100,100,100
Brighton Heights,100,100,100,100,100,100
Bulls Head,100,100,100,100,100,100
Castleton Corners,100,100,100,100,100,100
Charleston,100,100,100,100,100,100


In [67]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

There are 169 uniques categories.


array(['Gourmet Shop', 'Pizza Place', 'Ice Cream Shop', 'Diner',
       'Liquor Store', 'Gym / Fitness Center', 'Sushi Restaurant',
       'Wine Shop', 'Italian Restaurant', 'Restaurant', 'Pharmacy',
       'Bakery', 'Coffee Shop', 'Sports Bar', 'Bagel Shop', 'Gym',
       'Japanese Restaurant', 'Gastropub', 'Clothing Store', 'Park',
       'Toy / Game Store', 'Beach', 'Spa', 'Bar', 'Cosmetics Shop',
       'Fruit & Vegetable Store', 'Furniture / Home Store',
       'Warehouse Store', 'Grocery Store', 'American Restaurant',
       'Burger Joint', 'Seafood Restaurant', 'Electronics Store',
       'Food Service', 'Bookstore', 'History Museum', 'Big Box Store',
       'Trail', 'Steakhouse', 'Department Store', 'Shoe Store',
       'Spanish Restaurant', 'Donut Shop', 'Gift Shop',
       'Mexican Restaurant', 'Frozen Yogurt Shop', 'Discount Store',
       'Health & Beauty Service', 'Movie Theater', 'Golf Course'],
      dtype=object)

# Analyze some neighborhoods for Coffee Shops

In [68]:
# one hot encoding
si_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
si_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [si_onehot.columns[-1]] + list(si_onehot.columns[:-1])
si_onehot = si_onehot[fixed_columns]

print(si_onehot.shape)
si_onehot.head()

(6700, 170)


Unnamed: 0,Neighborhoods,Accessories Store,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Automotive Shop,...,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Zoo
0,Annadale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Annadale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Annadale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Annadale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Annadale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
si_grouped = si_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(si_grouped.shape)
si_grouped

(67, 170)


Unnamed: 0,Neighborhoods,Accessories Store,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Automotive Shop,...,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Zoo
0,Annadale,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0
1,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0
2,Arlington,0.02,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.01,0.01
3,Arrochar,0.0,0.03,0.0,0.01,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01
4,Bay Terrace,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0
5,Bloomfield,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.0,0.0,...,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.01,0.0
6,Brighton Heights,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Bulls Head,0.0,0.03,0.0,0.01,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01
8,Castleton Corners,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.02,0.0,0.01
9,Charleston,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0


In [70]:
len(si_grouped[si_grouped["Coffee Shop"] > 0])

66

In [71]:
si_coffeeshop = si_grouped[["Neighborhoods","Coffee Shop"]]
si_coffeeshop.head()

Unnamed: 0,Neighborhoods,Coffee Shop
0,Annadale,0.04
1,Arden Heights,0.04
2,Arlington,0.03
3,Arrochar,0.05
4,Bay Terrace,0.05


# Cluster Neighborhoods

In [72]:
# set number of clusters
kclusters = 4

si_clustering = si_coffeeshop.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(si_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 2, 0, 1, 1, 0, 3, 1, 2, 2], dtype=int32)

In [73]:
# create a new dataframe that includes the cluster as well as the top venues for each neighborhood.
si_merged = si_coffeeshop.copy()

# add clustering labels
si_merged["Cluster Labels"] = kmeans.labels_

In [74]:
si_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
si_merged.head()

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels
0,Annadale,0.04,2
1,Arden Heights,0.04,2
2,Arlington,0.03,0
3,Arrochar,0.05,1
4,Bay Terrace,0.05,1


In [75]:
# merge si_grouped with si_data to add latitude/longitude for each neighborhood
si_merged = si_merged.join(si_df.set_index("Neighborhood"), on="Neighborhood")

print(si_merged.shape)
si_merged.head() # check the last columns!

(67, 5)


Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
0,Annadale,0.04,2,40.549206,-74.17471
1,Arden Heights,0.04,2,40.559889,-74.198788
2,Arlington,0.03,0,40.637188,-74.167461
3,Arrochar,0.05,1,40.64242,-74.07527
4,Bay Terrace,0.05,1,40.554526,-74.135852


In [76]:
# sort the results by Cluster Labels
print(si_merged.shape)
si_merged.sort_values(["Cluster Labels"], inplace=True)
si_merged

(67, 5)


Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
33,Midland Beach,0.03,0,40.56956,-74.09046
40,Pleasant Plains,0.03,0,40.523708,-74.219401
44,Randall Manor,0.03,0,40.632763,-74.099126
38,Old Place,0.03,0,40.600733,-74.105503
37,Oakwood,0.03,0,40.629749,-74.101259
36,New Springville,0.03,0,40.637317,-74.10355
35,New Dorp,0.03,0,40.569361,-74.107686
34,New Brighton,0.03,0,40.644738,-74.088965
65,Willowbrook,0.03,0,40.603204,-74.141255
45,Richmond Valley,0.03,0,40.520265,-74.229972


In [77]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(si_merged['Latitude'], si_merged['Longitude'], si_merged['Neighborhood'], si_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [78]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

# Conclusion:

### What we were trying to do in this exercise is to find areas on Staten Island with coffee shops in a nearby cluster.
### This gives us an opportunity to explore opening a coffeehouse environment for music and using some of the nearby vendors as collateral for boosting/driving business.  

### Areas near Silver Lake and Meier's Corners represent potential landing spots for this business venture.