# Introduction

In this project I explore, segment, and cluster the neighborhoods in the city of Boston based upon the number of MBTA stops, methadone clinics, and halfway houses within each neighborhood. This project utilizes web scraping, Foursquare.com's API, and k-means clustering.

For the Boston neighborhood data, a webpage ("https://www.zip-codes.com/state/ma.asp") exists that has all the information needed to explore and cluster the neighborhoods in Boston. First, it's necessary to scrape the webpage and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

With the data in a structured format, I conduct the analysis to explore and cluster the neighborhoods in Boston.

### Import necessary libraries and extract from the website

In [365]:
import requests
website_url = requests.get('https://www.zip-codes.com/state/ma.asp').text

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
 <head>
  <title>
   Listing of all Zip Codes in the state of Massachusetts
  </title>
  <meta content="List of all Zip Codes for the state of Massachusetts, MA. Includes all counties and cities in Massachusetts." name="description"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="en-us" http-equiv="content-language"/>
  <meta content="index,follow" name="robots"/>
  <link href="https://www.zip-codes.com/state/ma.asp" rel="canonical"/>
  <link href="https://www.zip-codes.com/m/state/ma.asp" media="only screen and (max-width: 640px)" rel="alternate"/>
  <script async="" src="https://www.zip-codes.com/m/theme/ga/local-analytics.js">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];function gtag(){dataLayer.push(arguments);}gtag('js', new Date());gtag('config', 'UA-23873959-1');
  </script>
  <script async="async" src="https://www.googletagservices.com/tag/js/

### By inspecting the output, it's evident that the data is availabe in "table" and belongs to class="statTable"

In [366]:
Table1 = soup.find('table',{'class':'statTable'})
Table1

<table border="0" cellpadding="0" cellspacing="0" class="statTable" id="tblZIP" title="All Massachusetts ZIP Codes, City, County, Classification, and Area Codes." width="99%">
<tr>
<td class="label" title="All ZIP Codes for Massachusetts"><strong>ZIP Code</strong></td>
<td class="info" title="The official city name as designated by the USPS."><strong>City</strong></td>
<td class="info" title="The primary county or parish this ZIP Code serves."><strong>County</strong></td>
<td class="info" title="The classification type of this ZIP Code."><strong>Type</strong></td>
</tr><tr><td><a href="/zip-code/01001/zip-code-01001.asp" title="ZIP Code 01001">ZIP Code 01001</a></td><td><a href="/city/ma-agawam.asp" title="Agawam, MA">Agawam</a></td><td><a href="/county/ma-hampden.asp">Hampden</a></td><td>Standard</td></tr><tr><td><a href="/zip-code/01002/zip-code-01002.asp" title="ZIP Code 01002">ZIP Code 01002</a></td><td><a href="/city/ma-amherst.asp" title="Amherst, MA">Amherst</a></td><td><a href=

In [367]:
print(Table1.tr.text)


ZIP Code
City
County
Type



In [368]:
headers="ZIP Code,City,County,Type"

In [369]:
Table2=""
for tr in Table1.find_all('tr'):
    row=""
    for tds in tr.find_all('td'):
        row=row+","+tds.text
    Table2=Table2+row[1:]+"\n"
print(Table2)

ZIP Code,City,County,Type
ZIP Code 01001,Agawam,Hampden,Standard
ZIP Code 01002,Amherst,Hampshire,Standard
ZIP Code 01003,Amherst,Hampshire,Standard
ZIP Code 01004,Amherst,Hampshire,P.O. Box
ZIP Code 01005,Barre,Worcester,Standard
ZIP Code 01007,Belchertown,Hampshire,Standard
ZIP Code 01008,Blandford,Hampden,Standard
ZIP Code 01009,Bondsville,Hampden,P.O. Box
ZIP Code 01010,Brimfield,Hampden,Standard
ZIP Code 01011,Chester,Hampden,Standard
ZIP Code 01012,Chesterfield,Hampshire,Standard
ZIP Code 01013,Chicopee,Hampden,Standard
ZIP Code 01014,Chicopee,Hampden,P.O. Box
ZIP Code 01020,Chicopee,Hampden,Standard
ZIP Code 01021,Chicopee,Hampden,P.O. Box
ZIP Code 01022,Chicopee,Hampden,Standard
ZIP Code 01026,Cummington,Hampshire,Standard
ZIP Code 01027,Easthampton,Hampshire,Standard
ZIP Code 01028,East Longmeadow,Hampden,Standard
ZIP Code 01029,East Otis,Berkshire,P.O. Box
ZIP Code 01030,Feeding Hills,Hampden,Standard
ZIP Code 01031,Gilbertville,Worcester,Standard
ZIP Code 01032,Goshen,Hampsh

In [370]:
file_Boston=open("Boston.csv","wb")
#file_Boston.write(bytes(headers,encoding="ascii",errors="ignore"))
file_Boston.write(bytes(Table2,encoding="ascii",errors="ignore"))

29402

In [371]:
import pandas as pd
Bos_Suffolk = pd.read_csv('Boston.csv',header=None)
Bos_Suffolk.columns=["ZIP Code","City","County","Type"]

Bos_Suffolk.head(10)

Unnamed: 0,ZIP Code,City,County,Type
0,ZIP Code,City,County,Type
1,ZIP Code 01001,Agawam,Hampden,Standard
2,ZIP Code 01002,Amherst,Hampshire,Standard
3,ZIP Code 01003,Amherst,Hampshire,Standard
4,ZIP Code 01004,Amherst,Hampshire,P.O. Box
5,ZIP Code 01005,Barre,Worcester,Standard
6,ZIP Code 01007,Belchertown,Hampshire,Standard
7,ZIP Code 01008,Blandford,Hampden,Standard
8,ZIP Code 01009,Bondsville,Hampden,P.O. Box
9,ZIP Code 01010,Brimfield,Hampden,Standard


In [372]:
del Bos_Suffolk['Type']
Bos_Suffolk.head(10)

Unnamed: 0,ZIP Code,City,County
0,ZIP Code,City,County
1,ZIP Code 01001,Agawam,Hampden
2,ZIP Code 01002,Amherst,Hampshire
3,ZIP Code 01003,Amherst,Hampshire
4,ZIP Code 01004,Amherst,Hampshire
5,ZIP Code 01005,Barre,Worcester
6,ZIP Code 01007,Belchertown,Hampshire
7,ZIP Code 01008,Blandford,Hampden
8,ZIP Code 01009,Bondsville,Hampden
9,ZIP Code 01010,Brimfield,Hampden


In [373]:
Suffolk = Bos_Suffolk[Bos_Suffolk['County'] == 'Suffolk'].reset_index(drop=True)
Suffolk.head(10)

Unnamed: 0,ZIP Code,City,County
0,ZIP Code 02108,Boston,Suffolk
1,ZIP Code 02109,Boston,Suffolk
2,ZIP Code 02110,Boston,Suffolk
3,ZIP Code 02111,Boston,Suffolk
4,ZIP Code 02112,Boston,Suffolk
5,ZIP Code 02113,Boston,Suffolk
6,ZIP Code 02114,Boston,Suffolk
7,ZIP Code 02115,Boston,Suffolk
8,ZIP Code 02116,Boston,Suffolk
9,ZIP Code 02117,Boston,Suffolk


In [374]:
Suffolk['ZIP Code'] = Suffolk['ZIP Code'].str.strip('ZIP Code')
Suffolk.head(10)

Unnamed: 0,ZIP Code,City,County
0,2108,Boston,Suffolk
1,2109,Boston,Suffolk
2,2110,Boston,Suffolk
3,2111,Boston,Suffolk
4,2112,Boston,Suffolk
5,2113,Boston,Suffolk
6,2114,Boston,Suffolk
7,2115,Boston,Suffolk
8,2116,Boston,Suffolk
9,2117,Boston,Suffolk


In [375]:
Suffolk = Suffolk.rename(columns={"ZIP Code": "ZIP"})
Suffolk.head(5)

Unnamed: 0,ZIP,City,County
0,2108,Boston,Suffolk
1,2109,Boston,Suffolk
2,2110,Boston,Suffolk
3,2111,Boston,Suffolk
4,2112,Boston,Suffolk


In [376]:
del Suffolk['County']
Suffolk.head(10)

Unnamed: 0,ZIP,City
0,2108,Boston
1,2109,Boston
2,2110,Boston
3,2111,Boston
4,2112,Boston
5,2113,Boston
6,2114,Boston
7,2115,Boston
8,2116,Boston
9,2117,Boston


In [377]:
Suffolk.dtypes

ZIP     object
City    object
dtype: object

In [378]:
Suffolk.head()

Unnamed: 0,ZIP,City
0,2108,Boston
1,2109,Boston
2,2110,Boston
3,2111,Boston
4,2112,Boston


### Someone was kind enough to create a .csv list of zip codes with coordinates on github. I use this to get the latitude and longitudes for each zip code.

In [379]:
coords=pd.read_csv('https://gist.githubusercontent.com/erichurst/7882666/raw/5bdc46db47d9515269ab12ed6fb2850377fd869e/US%2520Zip%2520Codes%2520from%25202013%2520Government%2520Data', dtype={'ZIP': str})
coords.head(5)

Unnamed: 0,ZIP,LAT,LNG
0,601,18.180555,-66.749961
1,602,18.361945,-67.175597
2,603,18.455183,-67.119887
3,606,18.158345,-66.932911
4,610,18.295366,-67.125135


In [380]:
coords.dtypes

ZIP     object
LAT    float64
LNG    float64
dtype: object

In [381]:
Boston = pd.merge(Suffolk, coords, on='ZIP')
Boston

Unnamed: 0,ZIP,City,LAT,LNG
0,2108,Boston,42.357768,-71.064858
1,2109,Boston,42.367032,-71.050493
2,2110,Boston,42.361962,-71.047846
3,2111,Boston,42.350518,-71.059077
4,2113,Boston,42.365331,-71.055233
5,2114,Boston,42.363174,-71.068646
6,2115,Boston,42.337105,-71.105696
7,2116,Boston,42.350579,-71.076397
8,2118,Boston,42.337582,-71.070482
9,2119,Roxbury,42.324029,-71.085017


### Install necessary libraries

In [382]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


In [383]:
address = 'Boston, MA'

geolocator = Nominatim(user_agent="Boston_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Boston are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Boston are 42.3602534, -71.0582912.


In [384]:
# Create map of Boston using latitude and longitude values
Boston_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers to map
for ZIP, City, LAT, LNG in zip(Boston['ZIP'], Boston['City'], Boston['LAT'], Boston['LNG']):
    label = '{}, {}'.format(ZIP, City)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [LAT, LNG],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Boston_map)  
    
Boston_map

### Define Foursquare Credentials and Version

In [385]:
CLIENT_ID = 'WZYT5YWOESAEIPU3WBWLQ5UYVUXC1OFKQXB1V4USLJUO5ITZ' # your Foursquare ID
CLIENT_SECRET = 'QZ355UJ3GDVVYU1FXA2CYTJC0H4BG4OZVDXRAJOHMQBIG2YF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WZYT5YWOESAEIPU3WBWLQ5UYVUXC1OFKQXB1V4USLJUO5ITZ
CLIENT_SECRET:QZ355UJ3GDVVYU1FXA2CYTJC0H4BG4OZVDXRAJOHMQBIG2YF


In [485]:
import urllib
def getNearbyVenues(names, latitudes, longitudes, radius=10000, LIMIT=1000, categoryIds=''):
    try:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            #print(name)

            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)

            if (categoryIds != ''):
                url = url + '&categoryId={}'
                url = url.format(categoryIds)

            # make the GET request
            response = requests.get(url).json()
            results = response["response"]['venues']

            # return only relevant information for each nearby venue
            for v in results:
                success = False
                try:
                    category = v['categories'][0]['name']
                    success = True
                except:
                    pass

                if success:
                    venues_list.append([(
                        name, 
                        lat, 
                        lng, 
                        v['name'], 
                        v['location']['lat'], 
                        v['location']['lng'],
                        v['categories'][0]['name']
                    )])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['ZIP', 
                  'LAT', 
                  'LNG', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',  
                  'Venue Category']
    
    except:
        print(url)
        print(response)
        print(results)
        print(nearby_venues)

    return(nearby_venues)

In [487]:
#https://developer.foursquare.com/docs/resources/categories
# Mental Health Office = 52e81612bcbc57f1066b7a39

Boston_venues1 = getNearbyVenues(names=Boston['ZIP'], latitudes=Boston['LAT'], longitudes=Boston['LNG'], 
                                categoryIds='52e81612bcbc57f1066b7a39')
Boston_venues1

Unnamed: 0,ZIP,LAT,LNG,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2108,42.357768,-71.064858,Commonwealth Psychology Associates,42.355988,-71.057493,Mental Health Office
1,2108,42.357768,-71.064858,ARTrelief,42.349788,-71.068452,Art Studio
2,2108,42.357768,-71.064858,"Robert Stern, MD",42.348881,-71.147549,Mental Health Office
3,2108,42.357768,-71.064858,Dr. Solomon Carter Fuller Mental Health Center,42.336363,-71.070939,Mental Health Office
4,2108,42.357768,-71.064858,Freedom Trail Clinic,42.363636,-71.062321,Mental Health Office
5,2108,42.357768,-71.064858,Psychological Care Associates,42.417725,-71.158112,Mental Health Office
6,2108,42.357768,-71.064858,"Charlene C. Chow, Lmhc, Mph",42.330116,-71.191246,Mental Health Office
7,2108,42.357768,-71.064858,Square Psychiatry,42.36263,-71.191553,Mental Health Office
8,2108,42.357768,-71.064858,Emerson College - Counseling Center,42.352099,-71.065478,Mental Health Office
9,2108,42.357768,-71.064858,Erich Lindemann Mental Health Center,42.363127,-71.06235,Mental Health Office


In [488]:
Boston_venues1 = Boston_venues1.rename(columns={"Venue Category": "v_category", "Venue Latitude": "v_lat", "Venue Longitude": "v_lng"})
Boston_venues1.head(5)

Unnamed: 0,ZIP,LAT,LNG,Venue,v_lat,v_lng,v_category
0,2108,42.357768,-71.064858,Commonwealth Psychology Associates,42.355988,-71.057493,Mental Health Office
1,2108,42.357768,-71.064858,ARTrelief,42.349788,-71.068452,Art Studio
2,2108,42.357768,-71.064858,"Robert Stern, MD",42.348881,-71.147549,Mental Health Office
3,2108,42.357768,-71.064858,Dr. Solomon Carter Fuller Mental Health Center,42.336363,-71.070939,Mental Health Office
4,2108,42.357768,-71.064858,Freedom Trail Clinic,42.363636,-71.062321,Mental Health Office


In [489]:
Boston_venues1.v_category.unique()

array(['Mental Health Office', 'Art Studio', 'Hospital Ward',
       'Medical Center', "Doctor's Office", 'Hospital', 'Emergency Room'],
      dtype=object)

In [490]:
Boston_mh = Boston_venues1[Boston_venues1['v_category'] == 'Mental Health Office'].reset_index(drop=True)
Boston_mh.head(10)

Unnamed: 0,ZIP,LAT,LNG,Venue,v_lat,v_lng,v_category
0,2108,42.357768,-71.064858,Commonwealth Psychology Associates,42.355988,-71.057493,Mental Health Office
1,2108,42.357768,-71.064858,"Robert Stern, MD",42.348881,-71.147549,Mental Health Office
2,2108,42.357768,-71.064858,Dr. Solomon Carter Fuller Mental Health Center,42.336363,-71.070939,Mental Health Office
3,2108,42.357768,-71.064858,Freedom Trail Clinic,42.363636,-71.062321,Mental Health Office
4,2108,42.357768,-71.064858,Psychological Care Associates,42.417725,-71.158112,Mental Health Office
5,2108,42.357768,-71.064858,"Charlene C. Chow, Lmhc, Mph",42.330116,-71.191246,Mental Health Office
6,2108,42.357768,-71.064858,Square Psychiatry,42.36263,-71.191553,Mental Health Office
7,2108,42.357768,-71.064858,Emerson College - Counseling Center,42.352099,-71.065478,Mental Health Office
8,2108,42.357768,-71.064858,Erich Lindemann Mental Health Center,42.363127,-71.06235,Mental Health Office
9,2108,42.357768,-71.064858,DCS Mental Health,42.409763,-71.105776,Mental Health Office


In [491]:
Boston_mh.sort_values("Venue", axis=0, ascending = True, inplace = True)
Boston_mh

Unnamed: 0,ZIP,LAT,LNG,Venue,v_lat,v_lng,v_category
109,2111,42.350518,-71.059077,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
731,2151,42.41829,-71.001251,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
54,2109,42.367032,-71.050493,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
699,2150,42.396824,-71.031348,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
678,2136,42.255083,-71.12922,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
660,2135,42.349688,-71.153964,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
80,2110,42.361962,-71.047846,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
634,2134,42.358016,-71.128608,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
592,2132,42.280455,-71.162017,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office
563,2131,42.284333,-71.126228,305 Newbury - Psychotherapy Offices,42.348764,-71.085515,Mental Health Office


In [492]:
print(Boston_mh.shape)

(913, 7)


In [493]:
Boston_mh.groupby('ZIP').count()

Unnamed: 0_level_0,LAT,LNG,Venue,v_lat,v_lng,v_category
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2108,28,28,28,28,28,28
2109,29,29,29,29,29,29
2110,28,28,28,28,28,28
2111,27,27,27,27,27,27
2113,28,28,28,28,28,28
2114,29,29,29,29,29,29
2115,30,30,30,30,30,30
2116,29,29,29,29,29,29
2118,27,27,27,27,27,27
2119,28,28,28,28,28,28


In [494]:
print('There are {} uniques categories.'.format(len(Boston_bus['v_category'].unique())))

There are 1 uniques categories.


In [495]:
# one hot encoding
Boston_onehot = pd.get_dummies(Boston_mh[['v_category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Boston_onehot['ZIP'] = Boston_mh['ZIP'] 

# move neighborhood column to the first column
fixed_columns = [Boston_onehot.columns[-1]] + list(Boston_onehot.columns[:-1])
Boston_onehot = Boston_onehot[fixed_columns]

Boston_onehot.head(20)

Unnamed: 0,ZIP,Mental Health Office
109,2111,1
731,2151,1
54,2109,1
699,2150,1
678,2136,1
660,2135,1
80,2110,1
634,2134,1
592,2132,1
563,2131,1


In [496]:
Boston_onehot.shape

(913, 2)

In [497]:
Boston_grouped = Boston_onehot.groupby('ZIP').mean().reset_index()
Boston_grouped

Unnamed: 0,ZIP,Mental Health Office
0,2108,1
1,2109,1
2,2110,1
3,2111,1
4,2113,1
5,2114,1
6,2115,1
7,2116,1
8,2118,1
9,2119,1


In [498]:
Boston_grouped.shape

(33, 2)