# Capstone Project - The Battle of Neighborhoods (Week 1 & 2)

## A description of the problem and a discussion of the background

In a hypothetical scenario, I'm a business man that needs to travel a lot. Every trip I do, I spent around one month in the city, so, for me is very important to be in a neighborhood with diversity of venues.

<font color='blue'>My next trip is to San Francisco, in which neighborhood should I stay?</font>

**Notes:**
- I spend all day out of the hotel, so, the hotel is not a big issue;
- As I said, I'm travelling on business, so, the hotel cost is not a concern also;
- What I need is to be in a region that maximize the chances of having anything that I could need.

## A description of the data and how it will be used to solve the problem.

Data request:
1. List of neighborhoods of San Francisco
2. List of venues of each neighborhood

Approach:
- Step 1:
After collect the data, understand the connections and relations among them. It should be San Francisco has neighborhoods > Neighborhoods have venues
- Step 2:
Create an analytical dataset connecting the data as it was described on step 1
- Step 3:
Do an exploratory analysis on all data
- Step 4:
If needed, cluster the neighborhood based on the category of venues
- Step 5:
Do the diversity of venues analysis
- Step 7:
Plot on map the 5 better neighborhoods

# Lybraries

In [1]:
!pip install bs4
!pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 5.2 MB/s 
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [47]:
import pandas as pd
import plotly.express as px

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

# Collecting Data

## Webscraping (San Francisco's Neighborhoods)

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco'
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")


San Francisco's neighborhood list:

In [4]:
lis = soup.find_all("span", {"class": "toctext"})
lis = [x.text for x in lis[:-4]]
lis

['Alamo Square',
 'Anza Vista',
 'Ashbury Heights',
 'Balboa Hollow',
 'Balboa Terrace',
 'The Bayview',
 'Belden Place',
 'Bernal Heights',
 'Buena Vista',
 'Butchertown (Old and New)',
 'The Castro',
 'Cathedral Hill',
 'Cayuga Terrace',
 'China Basin',
 'Chinatown',
 'Civic Center',
 'Clarendon Heights',
 'Cole Valley',
 'Corona Heights',
 'Cow Hollow',
 'Crocker-Amazon',
 'Design District',
 'Diamond Heights',
 'Dogpatch',
 'Dolores Heights',
 'Duboce Triangle',
 'The Embarcadero',
 'Eureka Valley',
 'The Excelsior',
 'The Fillmore',
 'The Financial District',
 'The Financial District South',
 "Fisherman's Wharf",
 'Forest Hill',
 'Forest Knolls',
 'Glen Park',
 'Golden Gate Heights',
 'The Haight',
 'Hayes Valley',
 'Hunters Point',
 'India Basin',
 'Ingleside',
 'Ingleside Terraces',
 'The Inner Sunset',
 'Irish Hill',
 'Islais Creek',
 'Jackson Square',
 'Japantown',
 'Jordan Park',
 'Laguna Honda',
 'Lake Street',
 'Lakeside',
 'Lakeshore',
 'Laurel Heights',
 'Lincoln Manor',


## Populate latlng for each San Francisco's neighborhood

Create the dataframe for the neigborhoods

In [5]:
df = pd.DataFrame({'Neighborhood': lis})
print(df.shape)
df.head()

(119, 1)


Unnamed: 0,Neighborhood
0,Alamo Square
1,Anza Vista
2,Ashbury Heights
3,Balboa Hollow
4,Balboa Terrace


The geocode API doesn't work property for 'Sunnyside' and 'Vista del Mar', so, I dropped it.

In [6]:
df.drop(df[df['Neighborhood'].isin(['Sunnyside', 'Vista del Mar'])].index, inplace=True)
print(df.shape)
df.head()

(117, 1)


Unnamed: 0,Neighborhood
0,Alamo Square
1,Anza Vista
2,Ashbury Heights
3,Balboa Hollow
4,Balboa Terrace


In [7]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

def calc_latlng(row):
  neighbor = row['Neighborhood'].split(',')[0]
  address = '{}, San Francisco, CA'.format(neighbor)

  geolocator = Nominatim(user_agent="ny_explorer")
  location = geolocator.geocode(address)
  if (location):
    print('The geograpical coordinate of {} are {}, {}.'.format(neighbor, location.latitude, location.longitude))
    return [location.latitude, location.longitude]
  else:
    print('error -> ', neighbor, '\n\n')
    return [0.0, 0.0]

In [8]:
df['latlng'] = df.apply(lambda x: calc_latlng(x), axis=1)

df.head()

The geograpical coordinate of Alamo Square are 37.7763599, -122.43470002366266.
The geograpical coordinate of Anza Vista are 37.7808364, -122.4431489.
error ->  Ashbury Heights 


The geograpical coordinate of Balboa Hollow are 37.798793700000004, -122.43609848645681.
The geograpical coordinate of Balboa Terrace are 32.809471, -117.208557.
The geograpical coordinate of The Bayview are 37.7288889, -122.3925.
The geograpical coordinate of Belden Place are 37.791744, -122.4038861.
The geograpical coordinate of Bernal Heights are 37.7429861, -122.4158042.
The geograpical coordinate of Buena Vista are 37.8065321, -122.4206485.
error ->  Butchertown (Old and New) 


The geograpical coordinate of The Castro are 37.7608561, -122.434957.
error ->  Cathedral Hill 


The geograpical coordinate of Cayuga Terrace are 37.7302967, -122.4329293473373.
The geograpical coordinate of China Basin are 37.7771799, -122.3866825.
The geograpical coordinate of Chinatown are 37.7943011, -122.4063757.
The geogra

Unnamed: 0,Neighborhood,latlng
0,Alamo Square,"[37.7763599, -122.43470002366266]"
1,Anza Vista,"[37.7808364, -122.4431489]"
2,Ashbury Heights,"[0.0, 0.0]"
3,Balboa Hollow,"[37.798793700000004, -122.43609848645681]"
4,Balboa Terrace,"[32.809471, -117.208557]"


In [9]:
df['Latitude'] = df['latlng'].apply(lambda x: x[0])
df['Longitude'] = df['latlng'].apply(lambda x: x[1])
df.drop(columns=['latlng'], inplace=True)
df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alamo Square,37.77636,-122.4347
1,Anza Vista,37.780836,-122.443149
2,Ashbury Heights,0.0,0.0
3,Balboa Hollow,37.798794,-122.436098
4,Balboa Terrace,32.809471,-117.208557


Removing the row that geocode didn't found the latlng:

In [10]:
df.drop(df[df['Latitude'] == 0.0].index, inplace=True)
print(df.shape)
df.head()

(99, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alamo Square,37.77636,-122.4347
1,Anza Vista,37.780836,-122.443149
3,Balboa Hollow,37.798794,-122.436098
4,Balboa Terrace,32.809471,-117.208557
5,The Bayview,37.728889,-122.3925


## Venues data from Foursquare (venues location and rating)

Define Foursquare Credentials and Version

In [11]:
CLIENT_ID = '3UNUMQW45CQVFYKO531O03ZXL2WVZH3RZXJFD53QOMT2IMYG' # your Foursquare ID
CLIENT_SECRET = 'EHYX0QAJJEGEMF0E4IF1QPZTEMWJGVBXFK51PNFWFQ2GXOWJ' # your Foursquare Secret
ACCESS_TOKEN = '3F2YCC5QPL1JDZU0V5NVSQVASG0XH1QV5FTADZJJJ1OQX3JY' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value


print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3UNUMQW45CQVFYKO531O03ZXL2WVZH3RZXJFD53QOMT2IMYG
CLIENT_SECRET:EHYX0QAJJEGEMF0E4IF1QPZTEMWJGVBXFK51PNFWFQ2GXOWJ


Find venues location

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['id'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Id', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

In [23]:
print(venues.shape)
venues.head()

(4506, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Id,Venue Latitude,Venue Longitude,Venue Category
0,Alamo Square,37.77636,-122.4347,Alamo Square,4460d38bf964a5200a331fe3,37.776032,-122.433992,Park
1,Alamo Square,37.77636,-122.4347,Alamo Square Dog Park,4c2f7b013896e21e7efee390,37.775878,-122.43574,Dog Run
2,Alamo Square,37.77636,-122.4347,Painted Ladies,4b9afa7ef964a520c1e835e3,37.77612,-122.433389,Historic Site
3,Alamo Square,37.77636,-122.4347,Lucinda’s Deli,5ed16b62fadd0100088b2ee3,37.774757,-122.436239,Sandwich Place
4,Alamo Square,37.77636,-122.4347,The Independent,4249ec00f964a5208f201fe3,37.775573,-122.437835,Rock Club


# Exploratory analysis

## Plotting San Francisco's neighborhoods on map

In [92]:
address = 'San Francisco, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Francisco City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Francisco City are 37.7790262, -122.419906.


In [93]:
import folium # map rendering library

# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

## Diversity of venue's categories per neighborhood

In [33]:
venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Id,Venue Latitude,Venue Longitude,Venue Category
0,Alamo Square,37.77636,-122.4347,Alamo Square,4460d38bf964a5200a331fe3,37.776032,-122.433992,Park
1,Alamo Square,37.77636,-122.4347,Alamo Square Dog Park,4c2f7b013896e21e7efee390,37.775878,-122.43574,Dog Run
2,Alamo Square,37.77636,-122.4347,Painted Ladies,4b9afa7ef964a520c1e835e3,37.77612,-122.433389,Historic Site
3,Alamo Square,37.77636,-122.4347,Lucinda’s Deli,5ed16b62fadd0100088b2ee3,37.774757,-122.436239,Sandwich Place
4,Alamo Square,37.77636,-122.4347,The Independent,4249ec00f964a5208f201fe3,37.775573,-122.437835,Rock Club


In [68]:
div_df = venues[['Neighborhood', 'Venue', 'Venue Category']].groupby(by=['Neighborhood']).nunique().reset_index().sort_values(by=['Venue', 'Venue Category'], ascending=False)
div_df.rename(columns={'Neighborhood': 'neighborhood', 'Venue':'count_venue', 'Venue Category': 'count_category'}, inplace=True)

print(div_df.shape)
div_df.head()

(99, 3)


Unnamed: 0,neighborhood,count_venue,count_category
2,Balboa Hollow,100,67
24,Hayes Valley,100,67
14,Cow Hollow,100,66
83,The Marina,100,64
94,Union Square,100,61


In [76]:
fig_hist = px.histogram(div_df, x='count_venue', nbins=10,
                        title='Histogram venues',
                        width=400, height=300)
fig_hist.show()

**First insight:** Only 8 neighborhoods have more than 100 venues.

In [81]:
fig_hist = px.histogram(div_df, x='count_category', #nbins=10,
                         title='Histogram categories',
                         width=400, height=300)
fig_hist.show()

**Second insight:** Only 13 neighborhoods have more than 60 categories of venues.

In [84]:
div_df.head(13)

Unnamed: 0,neighborhood,count_venue,count_category
2,Balboa Hollow,100,67
24,Hayes Valley,100,67
14,Cow Hollow,100,66
83,The Marina,100,64
94,Union Square,100,61
72,Telegraph Hill,100,53
90,The Tenderloin,100,53
51,North Beach,100,49
67,South Beach,99,55
74,The Castro,98,69


In [98]:
top5_div_df = div_df[(div_df.count_venue >= 100) & (div_df.count_category >= 60)]
print(top5_div_df.shape)
top5_div_df

(5, 3)


Unnamed: 0,neighborhood,count_venue,count_category
2,Balboa Hollow,100,67
24,Hayes Valley,100,67
14,Cow Hollow,100,66
83,The Marina,100,64
94,Union Square,100,61


**Third insight:** Only the 05 neighborhoods above have more than 100 venues and 60 categories of venues.

# Plot the best 05 neighborhoods

In [101]:
latlng_top5_div_df = df[df.Neighborhood.isin(top5_div_df.neighborhood)]

In [107]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, neighborhood in zip(latlng_top5_div_df['Latitude'], latlng_top5_div_df['Longitude'], latlng_top5_div_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

# Result

As we can see on the map, 3 of our best 5 neighborhoods are very close, so, I decide that the best one for me is the neighborhood geographicly on the middle of those 3.

**In my trip to San Francisco I'll be hosted on:**<font color='blue'> 'Balboa Hollow' neighborhood.