<b>Exploring San Francisco Neighborhoods</b>

<p><b>Business Problem</b></p>
      <p> <b>Which neighborhood should I choose to live in San Francisco?</b></p>


<b>1. Problem Description and Background</b>

<p>San Francisco is the fourth populous city in the state of California located on the tip of a peninsula surrounded by the Pacific Ocean and San Francisco Bay. it is known for its iconic Golden Gate Bridge, cable cars, colorful Victorian houses and year round fog.</p>

<p>San Francisco is city of cinematic, ethnic, and historic neighborhoods with plethora of galleries, boutiques, most alluring hiking trails and parks, local and stylish restaurants with bustling night life. Each neighborhood carries its own charm and attract young, urban professionals and family people who are ethnically diverse with Irish, Russian, Hispanic, Italian, and Chinese roots.  Currently, San Francisco is a melting pot of diverse people which include families with babies and dogs, young and urban professionals, tech workers, blue-collar workers, retired people, affluent people, artists, hipsters, surfers, students and homeless people. </p>

<p>How do people from various backgrounds choose a neighborhood to live in San Francisco? One answer is based on their life style i.e activities and interests. To achieve this people choose to live closer to their interested venues. For this reason, exploring San Francisco neighborhoods to find various venues in each neighborhood is necessary to help make better decision of choosing a neighborhood to live in.</p>



<p><b>2. Data Description and Extraction</b></p>

<p>For the San Francisco neighborhood data, a Wikipedia page exists that has all the neighborhood information 
in a tabular form. First, the page is scraped using Beautiful Soup and it is wrangled, cleaned, and read into a pandas data frame.</p>

<p>Next, for each neighborhood the geo spacial coordinates are located using geo-codes. The neigborhood data is then merged with the geo spacial data.


<p><b>2.1 Import libraries</b></p>

In [2]:
#import libraries
import pandas as pd
import numpy as np

#Install beautifulsoup and html parser
!easy_install beautifulsoup4
!easy_install html5lib

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#import folium to render map
!conda install -c conda-forge folium=0.5.0 --yes
import folium

# import k-means from clustering stage
from sklearn.cluster import KMeans

print("Imported!")

Searching for beautifulsoup4
Best match: beautifulsoup4 4.6.0
Adding beautifulsoup4 4.6.0 to easy-install.pth file

Using /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Processing dependencies for beautifulsoup4
Finished processing dependencies for beautifulsoup4
Searching for html5lib
Best match: html5lib 0.999999999
Adding html5lib 0.999999999 to easy-install.pth file

Using /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Processing dependencies for html5lib
Finished processing dependencies for html5lib
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  26.86 MB/s
branca-0.3.1-p 100% |#################

<p><b>2.2 Scrape Wikipedia web page</b></p>

In [3]:
#Get page contents from the given URL
import requests
page = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco')

<p><b>2.3 Wrangle, Clean, and Read Data</b></p>

In [4]:
#Using BeautifulSoup to extract the page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
#Extract table from the page
html = list(soup.children)[2]
list(html.children)
body = list(html.children)[3]
body
sf_table = soup.find_all('table')[2]

#get borough list
list_th = sf_table.find_all("th")
list_th.pop(0) #pop out the title of the table from header list

#get neighborhood names list
list_td = sf_table.find_all("td")
list_td = list_td[:-1]   #delete extra data from the end of the list

records = []
i = 0
for td in list_td:
    neigh = []
    str = td.get_text().rstrip()
    if(str != ''):
        neigh = str.split('\n')
        for item in neigh:
            if(item != ''):
                record = []
                borough_name = list_th[i]
                record.append(borough_name.get_text().rstrip())
                record.append(item)
                records.append(record)
        i=i+1

#Create dataframe with table contents
header =['Borough', 'Neighborhood']
sf_df = pd.DataFrame(data=records)
sf_df.columns = ['Borough', 'Neighborhood']
print("There are 5 boroughs and {} neighborhoods in San Francisco".format(sf_df['Neighborhood'].count()))
print(sf_df.shape)
print(sf_df)



There are 5 boroughs and 60 neighborhoods in San Francisco
(60, 2)
              Borough           Neighborhood
0            Downtown              Chinatown
1            Downtown           Civic Center
2            Downtown     Financial District
3            Downtown         French Quarter
4            Downtown             Mid-Market
5            Downtown               Nob Hill
6            Downtown            North Beach
7            Downtown            Mission Bay
8            Downtown        South of Market
9            Downtown         Telegraph Hill
10           Downtown             Tenderloin
11           Downtown           Union Square
12  North of Downtown             Cow Hollow
13  North of Downtown      Fisherman's Wharf
14  North of Downtown        Marina District
15  North of Downtown        Pacific Heights
16  North of Downtown               Presidio
17  North of Downtown           Russian Hill
18  North of Downtown        Treasure Island
19  North of Downtown     Yerba B

<p><b>2.3 Locate Geo Spacial Coordinates for the neighborhoods</b></p>

In [6]:
#import Nominatim
from geopy.geocoders import Nominatim

latitude_list = []
longitude_list= []
records = sf_df['Neighborhood']

for item in records:
    #append city name to each neighborhood name
    address = item+', San Francisco'

    geolocator = Nominatim(user_agent="sf_explorer")
    location = geolocator.geocode(address)
    if(location):
        latitude_list.append(location.latitude)
        longitude_list.append(location.longitude)
    else:
        latitude_list.append('NA')
        longitude_list.append('NA')
        
#Make a copy of the sf_df dataframe
sf_df_geo = sf_df.copy(deep=True)
sf_df_geo['Latitude'] = latitude_list
sf_df_geo['Longitude'] = longitude_list
print(sf_df_geo)

              Borough           Neighborhood Latitude Longitude
0            Downtown              Chinatown  52.3752   4.90094
1            Downtown           Civic Center  37.7796  -122.417
2            Downtown     Financial District  37.7936  -122.399
3            Downtown         French Quarter       NA        NA
4            Downtown             Mid-Market       NA        NA
5            Downtown               Nob Hill  37.7933  -122.415
6            Downtown            North Beach  37.8012  -122.409
7            Downtown            Mission Bay  37.7708  -122.391
8            Downtown        South of Market  37.7809  -122.401
9            Downtown         Telegraph Hill  37.8027  -122.406
10           Downtown             Tenderloin  37.7842  -122.414
11           Downtown           Union Square  37.7879  -122.408
12  North of Downtown             Cow Hollow  37.7973  -122.436
13  North of Downtown      Fisherman's Wharf  37.8092  -122.417
14  North of Downtown        Marina Dist

<p><b>Observation: </b></p>
<p>Missing coordinations and wrong cordinations for few neighhoods.</p>
<p><b>Solution: </b><p>
<p>Download the sf_df_geo data frame into a cvs file and update the missing and error coordinates manually
and read the corrected cvs file into a data frame.</p>

In [30]:
# The code was removed by Watson Studio for sharing.

This cell is hidden as it contains credentials of the IBM Object Cloud.
Read the updated cvs file into sf_df_geocodes dataframe.


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,Alamo Square,37.776357,-122.434694
1,Anza Vista,37.780836,-122.443149
2,Bayview,37.728889,-122.3925
3,Bernal Heights,37.741001,-122.414214
4,Castro,37.7609,-122.435


In [31]:
sf_df_geocodes.columns = ['Neighborhood', 'Latitude', 'Longitude']
sf_df_geocodes.head(5)

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alamo Square,37.776357,-122.434694
1,Anza Vista,37.780836,-122.443149
2,Bayview,37.728889,-122.3925
3,Bernal Heights,37.741001,-122.414214
4,Castro,37.7609,-122.435


<p><b>2.4 Add geocodes to sf_df dataframe</b></p>

In [32]:
sf_df = sf_df.merge(sf_df_geocodes, how='inner')
sf_df.shape
print(sf_df)

              Borough        Neighborhood   Latitude   Longitude
0            Downtown           Chinatown  37.794100 -122.407800
1            Downtown        Civic Center  37.779594 -122.416794
2            Downtown  Financial District  37.793647 -122.398938
3            Downtown          Mid-Market  37.780500 -122.412500
4            Downtown            Nob Hill  37.793262 -122.415249
5            Downtown         North Beach  37.801175 -122.409002
6            Downtown         Mission Bay  37.770774 -122.391171
7            Downtown     South of Market  37.780893 -122.400952
8            Downtown      Telegraph Hill  37.802730 -122.405851
9            Downtown          Tenderloin  37.784249 -122.413993
10           Downtown        Union Square  37.787936 -122.407517
11  North of Downtown          Cow Hollow  37.797262 -122.436248
12  North of Downtown   Fisherman's Wharf  37.809167 -122.416599
13  North of Downtown     Marina District  37.802984 -122.437472
14  North of Downtown    

<p><b>2.5 Use Foursquare API to get list of venues in each San Francisco neighborhood</b></p>

In [33]:
# The code was removed by Watson Studio for sharing.

Initialized Foursquare credentials: client_id, client_secret, and version in this cell.


In [34]:
def getNearbyVenues(names, latitudes, longitudes, radius=250, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [35]:
#Get venues in each neighborhood of San Francisco
sf_venues = getNearbyVenues(names=sf_df['Neighborhood'],
                                   latitudes=sf_df['Latitude'],
                                   longitudes=sf_df['Longitude']
                                  )

print(sf_venues.shape)
print(sf_venues.head(5))

(1217, 7)
  Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0    Chinatown                37.7941               -122.4078   
1    Chinatown                37.7941               -122.4078   
2    Chinatown                37.7941               -122.4078   
3    Chinatown                37.7941               -122.4078   
4    Chinatown                37.7941               -122.4078   

                               Venue  Venue Latitude  Venue Longitude  \
0            Red Blossom Tea Company       37.794643      -122.406379   
1                       Mister Jiu's       37.793790      -122.406615   
2                      STEAP TEA BAR       37.793359      -122.406573   
3  Golden Star Vietnamese Restaurant       37.794526      -122.405603   
4                     Eastern Bakery       37.793776      -122.406178   

          Venue Category  
0               Tea Room  
1     Chinese Restaurant  
2        Bubble Tea Shop  
3  Vietnamese Restaurant  
4                 Bakery 

<p><b>Count of venues in each neighborhood of San Francisco</b></p>

In [36]:
sf_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alamo Square,5,5,5,5,5,5
Anza Vista,4,4,4,4,4,4
Bernal Heights,17,17,17,17,17,17
Castro,69,69,69,69,69,69
Chinatown,27,27,27,27,27,27
Civic Center,27,27,27,27,27,27
Cole Valley,11,11,11,11,11,11
Corona Heights,9,9,9,9,9,9
Cow Hollow,41,41,41,41,41,41
Crocker-Amazon,2,2,2,2,2,2


In [37]:
print("Number of unique categories identified in San Francisco neighborhood are {}".format(len(sf_venues['Venue Category'].unique()))) 

Number of unique categories identified in San Francisco neighborhood are 243
