<h1> Applied Data Science Captstone | The Battle of Neighborhoods | Finding a Good Place in Boston, MA </h1>

<h2> 1. Installing Python libraries that we will need. </h2>

In [1]:
!pip install folium
!pip install geocoder
!conda install -c conda-forge geopy --yes

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 5.7 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 9.2 MB/s  eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen sol

<h2> 2. Importing Pythong Libraries that we will be using. </h2>

In [45]:
# Libraries for making requests and scraping the results
from bs4 import BeautifulSoup
import requests

# import k-means from clustering stage
from sklearn.cluster import KMeans

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 
import geocoder

# library for data analsysis
import pandas as pd 

# map rendering library
import folium

# library to handle data in a vectorized manner
import numpy as np  

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print("Libraries Imported")

Libraries Imported


<h2> 3. Data Extraction </h2>

In [46]:
# This URL leads to a website that has Boston Communities (Neighborhoods) and Zip Codes.
# We will make a GET request and use Beautiful Soup to scrape the results
url = "http://archive.boston.com/news/local/articles/2007/04/15/sixfigurezipcodes_city"
extracting_data = requests.get(url).text
soup = BeautifulSoup(extracting_data, 'html.parser')

In [47]:
table_contents=[]
table=soup.findAll('table')

#The sixth table on this website has the zipcode information.
for row in table[6].findAll('tr'):
    cell = {}
    i = 0
    for col in row.findAll('td'):
        if i == 0:
            #The First td is the zip code.
            cell['Zip'] = col.text
            i = i+1
        elif i == 1:
            #The second td is the neighborhood.
            cell['Neighborhood'] = col.text
            i = i+1
        else:
            table_contents.append(cell)
            i=0
            break;
        
#After extracting the data from the table on the webpage, we compile the data into a DataFrame.        
df=pd.DataFrame(table_contents)
df.head()

Unnamed: 0,Zip,Neighborhood
0,2101,Downtown Boston
1,2108,Beacon Hill
2,2109,Markets / Inner Harbor
3,2110,Financial District / Wharves
4,2111,Chinatown / Tufts-New England Medical Center


<h2> 4. Adding Latitude and Longitude to the Data Frame </h2>

In [53]:
def get_latlong(zip_code):
    
    # initialize your variable to None
    lat_lng_coords = None
    
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Boston, Massachusetts'.format(zip_code))
        lat_lng_coords = g.latlng
        
    return lat_lng_coords

zipCodes = df['Zip']

coords = [ get_latlong(zipCode) for zipCode in zipCodes.tolist() ]

df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df['Latitude'] = df_coords['Latitude']
df['Longitude'] = df_coords['Longitude']

df.head()

Unnamed: 0,Zip,Neighborhood,Latitude,Longitude
0,2101,Downtown Boston,42.34724,-71.064563
1,2108,Beacon Hill,42.359005,-71.059746
2,2109,Markets / Inner Harbor,42.36032,-71.054845
3,2110,Financial District / Wharves,42.356035,-71.05481
4,2111,Chinatown / Tufts-New England Medical Center,42.350375,-71.06056


<h2> 5. Use the FourSquare API to get venues around Neighborhoods </h2>

In [54]:
# Setting up FourSquare API credentials
CLIENT_ID = 'S102LTFBXVB0WXGCEFE054ACP3FUVEXZDWGI4ZATIC0XY3VQ'
CLIENT_SECRET = '10V5V0OUYXFZQQC01P35HLNWO4DYDVOTVSTAKHNER2F1PK3K' 
VERSION = '20210324'

In [56]:
venues = []

#Make the FourSquare API call for the neighborhoods
for lat, long, zipCode, neighborhood in zip(df['Latitude'], df['Longitude'], df['Zip'], df['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        500, #Radius
        100) #Limit
    
    parsedResults = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in parsedResults:
        venues.append((
            zipCode,neighborhood,lat,long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))
    
# Orgranize the API response information into a dataframe
venuesDf = pd.DataFrame(venues)

venuesDf.columns = ['ZipCode', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

venuesDf.head()

Unnamed: 0,ZipCode,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,2101,Downtown Boston,42.34724,-71.064563,Whole Foods Market,42.345304,-71.063061,Grocery Store
1,2101,Downtown Boston,42.34724,-71.064563,Turnstyle Cycle,42.345806,-71.063228,Cycle Studio
2,2101,Downtown Boston,42.34724,-71.064563,Shore Leave,42.345279,-71.06387,Tiki Bar
3,2101,Downtown Boston,42.34724,-71.064563,Tatte Bakery & Cafe,42.344815,-71.063969,Bakery
4,2101,Downtown Boston,42.34724,-71.064563,Mike & Patty's,42.348604,-71.067913,Sandwich Place


<h2> 6. Analyze Data </h2>

In [57]:
venuesDf.groupby('Neighborhood').count()

Unnamed: 0_level_0,ZipCode,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Allston,17,17,17,17,17,17,17
Back Bay,100,100,100,100,100,100,100
Beacon Hill,100,100,100,100,100,100,100
Brighton,48,48,48,48,48,48,48
Brookline,107,107,107,107,107,107,107
Brookline Village,13,13,13,13,13,13,13
Cambridge,84,84,84,84,84,84,84
Charlestown,30,30,30,30,30,30,30
Chinatown / Tufts-New England Medical Center,100,100,100,100,100,100,100
Dorchester / Codman Square,9,9,9,9,9,9,9


<h2> 7. Map Boston </h2>

In [58]:
address = 'Boston'

geolocator = Nominatim(user_agent="final")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Boston are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Boston are 42.3602534, -71.0582912.


In [59]:
map_boston = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_boston)  
    
map_boston

<h2> 8. Data Analysis and Preparation for Clustering </h2>

In [60]:
# one hot encoding
boston_onehot = pd.get_dummies(venuesDf[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
boston_onehot['Neighborhood'] = venuesDf['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [boston_onehot.columns[-1]] + list(boston_onehot.columns[:-1])
boston_onehot = boston_onehot[fixed_columns]

boston_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
boston_grouped = boston_onehot.groupby('Neighborhood').mean().reset_index()
boston_onehot.shape

(2326, 245)

In [62]:
#Top Venues Most Common Venues
num_top_venues = 5

for hood in boston_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = boston_grouped[boston_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Allston----
          venue  freq
0           Gym  0.12
1   Rugby Pitch  0.06
2        Bakery  0.06
3   Gas Station  0.06
4  Squash Court  0.06


----Back Bay----
            venue  freq
0             Spa  0.05
1           Hotel  0.05
2             Gym  0.04
3  Cosmetics Shop  0.04
4   Women's Store  0.03


----Beacon Hill----
                 venue  freq
0          Coffee Shop  0.06
1        Historic Site  0.06
2   Seafood Restaurant  0.05
3       Sandwich Place  0.05
4  American Restaurant  0.04


----Brighton----
         venue  freq
0  Bus Station  0.08
1         Bank  0.06
2       Bakery  0.06
3  Pizza Place  0.06
4          Pub  0.04


----Brookline----
              venue  freq
0       Pizza Place  0.07
1              Café  0.05
2               Bar  0.03
3  Sushi Restaurant  0.03
4         Gift Shop  0.03


----Brookline Village----
               venue  freq
0          Gift Shop  0.15
1  Indian Restaurant  0.08
2        Fabric Shop  0.08
3      Grocery Store  0.08
4        

In [63]:
#Sorting the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [64]:
#Creating data frame for top ten venues for each neighborhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = boston_grouped['Neighborhood']

for ind in np.arange(boston_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boston_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allston,Gym,Rugby Pitch,Bakery,Gas Station,Squash Court,Grocery Store,Soccer Field,Tennis Court,Beer Garden,Coffee Shop
1,Back Bay,Spa,Hotel,Gym,Cosmetics Shop,Women's Store,Gym / Fitness Center,Sandwich Place,Coffee Shop,Clothing Store,Seafood Restaurant
2,Beacon Hill,Coffee Shop,Historic Site,Seafood Restaurant,Sandwich Place,American Restaurant,Pub,Park,Hotel,Plaza,Gastropub
3,Brighton,Bus Station,Bank,Bakery,Pizza Place,Pub,Coffee Shop,Chinese Restaurant,Café,Smoke Shop,Tanning Salon
4,Brookline,Pizza Place,Café,Bar,Sushi Restaurant,Gift Shop,Park,Coffee Shop,Gym,Donut Shop,Falafel Restaurant


<h2> 9. Clustering Neighborhoods </h2>

In [69]:
# set number of clusters
kclusters = 5

boston_grouped_clustering = boston_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boston_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [70]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boston_merged = df

# merge boston_grouped with df to add latitude/longitude for each neighborhood
boston_merged = boston_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

boston_merged.head() # check the last columns!

Unnamed: 0,Zip,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2101,Downtown Boston,42.34724,-71.064563,0,Coffee Shop,Italian Restaurant,Hotel,Sandwich Place,Seafood Restaurant,Bakery,American Restaurant,Historic Site,Gym / Fitness Center,Clothing Store
1,2108,Beacon Hill,42.359005,-71.059746,0,Coffee Shop,Historic Site,Seafood Restaurant,Sandwich Place,American Restaurant,Pub,Park,Hotel,Plaza,Gastropub
2,2109,Markets / Inner Harbor,42.36032,-71.054845,0,Italian Restaurant,Seafood Restaurant,Park,Historic Site,Bakery,Pub,American Restaurant,Sandwich Place,Hotel,Salad Place
3,2110,Financial District / Wharves,42.356035,-71.05481,0,Hotel,Seafood Restaurant,Historic Site,Sandwich Place,Park,Boat or Ferry,Salad Place,Café,Clothing Store,Coffee Shop
4,2111,Chinatown / Tufts-New England Medical Center,42.350375,-71.06056,0,Chinese Restaurant,Bakery,Asian Restaurant,Coffee Shop,Sushi Restaurant,Theater,Bubble Tea Shop,Performing Arts Venue,Café,Dessert Shop


<h2> 10. Mapping the Clusters </h2>

In [71]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boston_merged['Latitude'], boston_merged['Longitude'], boston_merged['Neighborhood'], boston_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h2> 11. Examine Clusters </h2>

In [72]:
#cluster 1
boston_merged.loc[boston_merged['Cluster Labels'] == 0, boston_merged.columns[[1] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Boston,Coffee Shop,Italian Restaurant,Hotel,Sandwich Place,Seafood Restaurant,Bakery,American Restaurant,Historic Site,Gym / Fitness Center,Clothing Store
1,Beacon Hill,Coffee Shop,Historic Site,Seafood Restaurant,Sandwich Place,American Restaurant,Pub,Park,Hotel,Plaza,Gastropub
2,Markets / Inner Harbor,Italian Restaurant,Seafood Restaurant,Park,Historic Site,Bakery,Pub,American Restaurant,Sandwich Place,Hotel,Salad Place
3,Financial District / Wharves,Hotel,Seafood Restaurant,Historic Site,Sandwich Place,Park,Boat or Ferry,Salad Place,Café,Clothing Store,Coffee Shop
4,Chinatown / Tufts-New England Medical Center,Chinese Restaurant,Bakery,Asian Restaurant,Coffee Shop,Sushi Restaurant,Theater,Bubble Tea Shop,Performing Arts Venue,Café,Dessert Shop
5,Downtown Boston,Coffee Shop,Italian Restaurant,Hotel,Sandwich Place,Seafood Restaurant,Bakery,American Restaurant,Historic Site,Gym / Fitness Center,Clothing Store
6,North End,Italian Restaurant,Seafood Restaurant,Pizza Place,Coffee Shop,Park,Bakery,Pub,Café,Sandwich Place,Skating Rink
7,West End / Back of the Hill,Pizza Place,Hotel,Café,Hotel Bar,Italian Restaurant,Bar,Sandwich Place,Donut Shop,Food Truck,Coffee Shop
8,Fenway / East Fens / Longwood,Concert Hall,Coffee Shop,Café,Bakery,Shoe Store,Grocery Store,Middle Eastern Restaurant,Hotel Bar,Burrito Place,Plaza
9,Back Bay,Spa,Hotel,Gym,Cosmetics Shop,Women's Store,Gym / Fitness Center,Sandwich Place,Coffee Shop,Clothing Store,Seafood Restaurant


In [73]:
#cluster 2
boston_merged.loc[boston_merged['Cluster Labels'] == 1, boston_merged.columns[[1] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
24,Roslindale,Lake,Yoga Studio,Performing Arts Venue,Music Venue,Nail Salon,National Park,New American Restaurant,Nightclub,Noodle House,Opera House
25,West Roxbury,Lake,Yoga Studio,Performing Arts Venue,Music Venue,Nail Salon,National Park,New American Restaurant,Nightclub,Noodle House,Opera House


In [42]:
#cluster 3
boston_merged.loc[boston_merged['Cluster Labels'] == 2, boston_merged.columns[[1] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Roxbury / Grove Hall,Food,Garden,Discount Store,Fish & Chips Shop,Pedestrian Plaza,Nail Salon,National Park,New American Restaurant,Nightclub,Noodle House


In [74]:
#cluster 4
boston_merged.loc[boston_merged['Cluster Labels'] == 3, boston_merged.columns[[1] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Jamaica Plain,Home Service,Yoga Studio,Performing Arts Venue,Music Venue,Nail Salon,National Park,New American Restaurant,Nightclub,Noodle House,Opera House


In [76]:
#cluster 5
boston_merged.loc[boston_merged['Cluster Labels'] == 4, boston_merged.columns[[1] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
39,North Brighton / Cambridge,Park,Residential Building (Apartment / Condo),Pool,College Hockey Rink,College Stadium,Gym,Noodle House,Opera House,Performing Arts Venue,Outdoor Sculpture


<h2> 12. Conclusion </h2>

<p> In Conclusion, we can see that there are a greater number of venues and neighborhoods in cluster 1 and that would be a better spot for a new venue.  </p> 