# Applied data science capston assignment 3 West Coast Vacation

Three cities are used in this mapping California, namely San Fracisco, Los Angeles, and San Diego to get the informaion og all its neighbors, and to run some clustering algos on it (with finding the best).

### Part 1: Initializing Coordinates

I will first import some libraries that I'll use for this assignment

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Then I'll create a geolocator object for each city

In [81]:
address_sf = 'San Francisco, CA, US'

geolocator_sf = Nominatim(user_agent="sf_explorer")
location_sf = geolocator_sf.geocode(address_sf)
latitude_sf = location_sf.latitude
longitude_sf = location_sf.longitude
print('The geograpical coordinate of San Francisco are {}, {}.'.format(latitude_sf, longitude_sf))

The geograpical coordinate of San Francisco are 37.7792808, -122.4192363.


In [28]:
address_la = 'Los Angeles, CA'

geolocator_la = Nominatim(user_agent="la_explorer")
location_la = geolocator_la.geocode(address_la)
latitude_la = location_la.latitude
longitude_la = location_la.longitude
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(latitude_la, longitude_la))

The geograpical coordinate of Los Angeles are 34.0536834, -118.2427669.


In [6]:
address_sd = 'San Diego, CA'

geolocator_sd = Nominatim(user_agent="sd_explorer")
location_sd = geolocator_sd.geocode(address_sd)
latitude_sd = location_sd.latitude
longitude_sd = location_sd.longitude
print('The geograpical coordinate of San Diego are {}, {}.'.format(latitude_sd, longitude_sd))

The geograpical coordinate of San Diego are 32.7174209, -117.1627714.


### Part 2: Scraping the Neighborhoods

I will now scrape wikipedia for the names of neighborhoods in each city

In [29]:
from bs4 import BeautifulSoup
import requests

In [30]:
source_sf = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco').text
source_la = requests.get('https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles').text
source_sd = requests.get('https://en.wikipedia.org/wiki/List_of_communities_and_neighborhoods_of_San_Diego').text

These wikipedia pages are quite different from each other. So, I'll use different approaches in scraping them.

#### I'll start with San Francisco

In [44]:
soup_sf = BeautifulSoup(source_sf, 'lxml')

The most reliable way to get neighbourhood names in San Francisco is to use h2 header

In [48]:
sf_hood_list = soup_sf.find_all('h2')

In [49]:
tag_list =[]
for tag in sf_hood_list:
    tag_list.append(tag.text)

By looking at this tag list, we can see that we need to get rid of '[edit]' string at the end of each of them and we also need to get rid of first item on the list(word 'contents') and last four items on the list (references, links, etc.) that are not the names of neighbourhood

In [51]:
tag_list = tag_list[1:-4]

In [56]:
sf_list = []
for item in tag_list:
    sf_list.append(item[0:-6])

Now that we have a clean list of all San Francisco Neighbourhoods, we can find their coordinates and create a list of lists with neighborhood names, latitudes, and longitudes. We'll first convert a list to pandas dataframe and then create new columns for longitude and latitude

In [62]:
sf_coordinates = pd.DataFrame(sf_list, columns =['Neighbourhood'])

In [83]:
sf_coordinates['Latitude']=0
sf_coordinates['Longitude']=0

In [85]:
for item in sf_coordinates['Neighbourhood']:
    address = '{}, San Francisco, CA, US'.format(item)
    geolocator_sf = Nominatim(user_agent="sf_explorer")
    location_sf = geolocator_sf.geocode(address)
    if location_sf is not None:
        sf_coordinates['Latitude'].loc[sf_coordinates['Neighbourhood']==item] = location_sf.latitude
        sf_coordinates['Longitude'].loc[sf_coordinates['Neighbourhood']==item] = location_sf.longitude

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [87]:
sf_coordinates.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Alamo Square,37.776357,-122.434694
1,Anza Vista,37.780836,-122.443149
2,Ashbury Heights,0.0,0.0
3,Balboa Park,37.724949,-122.444805
4,Balboa Terrace,0.0,0.0


In [88]:
sf_coordinates.tail()

Unnamed: 0,Neighbourhood,Latitude,Longitude
114,West Portal,37.741141,-122.465634
115,Western Addition,37.779559,-122.42981
116,Westwood Highlands,0.0,0.0
117,Westwood Park,0.0,0.0
118,Yerba Buena,-5.735634,-79.043992


We can see now that some neighborhoods have 0s for the longitude and latitude values. We can also see that the last neighborhood is not located in San Francisco (probably due to name being the same as a town in Peru). I'll drop these entries for the simplicity of the task

In [89]:
sf_coordinates = sf_coordinates[sf_coordinates['Latitude']!=0].iloc[:-1, :]
sf_coordinates

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Alamo Square,37.776357,-122.434694
1,Anza Vista,37.780836,-122.443149
3,Balboa Park,37.724949,-122.444805
5,Bayview,37.728889,-122.3925
6,Belden Place,37.791744,-122.403886
7,Bernal Heights,37.741001,-122.414214
8,Buena Vista,37.806532,-122.420648
10,Castro,37.760856,-122.434957
13,China Basin,37.77633,-122.391839
14,Chinatown,37.794301,-122.406376


And now we got it! Let's move on to Los Angeles

#### Los Angeles Neighbourhoods

I'll start by scraping the Wikipedia page first

In [115]:
soup_la = BeautifulSoup(source_la, 'lxml')

We can see that the names of neighborhoods are marked with li caption. They also have no class or id next to them. We will use it now to get a list of neighbourhoods

In [116]:
la_hood_list = soup_la.find_all('li',attrs= {'class':None, 'id': None})

In [117]:
tag_list = []
for hood in la_hood_list:
    tag_list.append(hood.text)

We still see other entries at the end of this list, I'll cut them out. I will also delete hyperlink residuals at the end of each neighbourhood name

In [118]:
tag_list = tag_list[:-40]

In [120]:
clean_la_list = []
for tag in tag_list:
    clean_la_list.append(tag.split('[')[0])

Now that we have a list of LA neighbourhoods we can put them in a dataframe and then find their coordinates like we did with San Francisco

In [122]:
la_coordinates = pd.DataFrame(clean_la_list, columns = ['Neighbourhood'])

In [123]:
la_coordinates['Latitude']=0
la_coordinates['Longitude']=0

In [125]:
for item in la_coordinates['Neighbourhood']:
    address = '{}, Los Angeles, CA, US'.format(item)
    geolocator_la = Nominatim(user_agent="la_explorer")
    location_la = geolocator_la.geocode(address)
    if location_la is not None:
        la_coordinates['Latitude'].loc[la_coordinates['Neighbourhood']==item] = location_la.latitude
        la_coordinates['Longitude'].loc[la_coordinates['Neighbourhood']==item] = location_la.longitude

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


And now I'll just cut all the neighborhoods, whose coordinates weren't found

In [127]:
la_coordinates = la_coordinates[la_coordinates['Latitude']!=0]
la_coordinates

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Angelino Heights,34.070289,-118.254796
1,Arleta,34.241327,-118.432205
3,Arts District,34.041239,-118.23445
4,Atwater Village,34.116398,-118.256464
5,Baldwin Hills,34.007568,-118.350596
13,Beverly Glen,34.107785,-118.445636
16,Beverly Park,34.063769,-118.26469
17,Beverlywood,34.046633,-118.395038
18,Boyle Heights,34.033166,-118.204865
19,Brentwood,34.05214,-118.47407


We have it! Let's do the same for San Diego

#### Getting location data for San Diego

In [130]:
soup_sd = BeautifulSoup(source_sd, 'lxml')

Similar to LA list, San Diego neighbourhoods are marked by li caption and have no class or id. So, I'll just repeat procedure I did for LA

In [133]:
sd_hood_list = soup_sd.find_all('li', attrs = {'class':None, 'id':None})

In [138]:
tag_list = []
for hood in sd_hood_list:
    tag_list.append(hood.text)

We have a list, but it also contains some entries that are not neighbourhoods. It makes sense because this page had neighbourhoods and communities on it. Neighbourhoods are at the very bottom of the list, so I'll try to keep just them. I will also make the names look clean and sort out double entries

In [146]:
cleaner_list = tag_list[-139:-22]

In [151]:
sd_list = []
for entry in cleaner_list:
    if '(' in entry:
        sd_list.append(entry.split('(')[0])
    elif ')' in entry:
        sd_list.append(entry.split(')')[0])
    else:
        sd_list.append(entry)

We can now create the dataframe and get coordinates

In [148]:
sd_coordinates = pd.DataFrame(sd_list, columns = ['Neighbourhood'])

In [149]:
sd_coordinates['Latitude']=0
sd_coordinates['Longitude']=0

In [152]:
for item in sd_coordinates['Neighbourhood']:
    address = '{}, San Diego, CA, US'.format(item)
    geolocator_sd = Nominatim(user_agent="sd_explorer")
    location_sd = geolocator_sd.geocode(address)
    if location_sd is not None:
        sd_coordinates['Latitude'].loc[sd_coordinates['Neighbourhood']==item] = location_sd.latitude
        sd_coordinates['Longitude'].loc[sd_coordinates['Neighbourhood']==item] = location_sd.longitude

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [154]:
sd_coordinates = sd_coordinates[sd_coordinates['Latitude']!=0]
sd_coordinates

Unnamed: 0,Neighbourhood,Latitude,Longitude
1,Bay Park,32.784638,-117.202605
2,Carmel Valley,32.943434,-117.213979
3,Clairemont,32.819505,-117.18234
4,Del Mar Heights,32.948811,-117.250785
5,Del Mar Mesa,32.941434,-117.182535
6,La Jolla,32.83259,-117.271684
8,Mission Beach,32.782557,-117.252592
10,Pacific Beach,32.797827,-117.240318
11,Pacific Highlands Ranch,32.964098,-117.191977
12,Torrey Hills,32.913769,-117.225549


And we're done scrapping and searching!

### Part 3: Getting the Venues

In [155]:
sf_coordinates['City']='San Francisco'
la_coordinates['City']='Los Angeles'
sd_coordinates['City']='San Diego'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [165]:
all_cities = pd.concat([sf_coordinates, la_coordinates, sd_coordinates])

In [169]:
all_cities.reset_index(drop=True)

Unnamed: 0,Neighbourhood,Latitude,Longitude,City
0,Alamo Square,37.776357,-122.434694,San Francisco
1,Anza Vista,37.780836,-122.443149,San Francisco
2,Balboa Park,37.724949,-122.444805,San Francisco
3,Bayview,37.728889,-122.3925,San Francisco
4,Belden Place,37.791744,-122.403886,San Francisco
5,Bernal Heights,37.741001,-122.414214,San Francisco
6,Buena Vista,37.806532,-122.420648,San Francisco
7,Castro,37.760856,-122.434957,San Francisco
8,China Basin,37.77633,-122.391839,San Francisco
9,Chinatown,37.794301,-122.406376,San Francisco


##### Initialize Foursquare credentials

In [159]:
CLIENT_ID = 'U2NNANEUQUQ2NFUN2N25XHS1A4GLC0J15D2LIISWSIYLEKF0' # your Foursquare ID
CLIENT_SECRET = 'GY5UVEHHCPD5DUXCTHI0MXWVDO5ND4OWAO3SMMMWYX2NZNVT' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: U2NNANEUQUQ2NFUN2N25XHS1A4GLC0J15D2LIISWSIYLEKF0
CLIENT_SECRET:GY5UVEHHCPD5DUXCTHI0MXWVDO5ND4OWAO3SMMMWYX2NZNVT


In [171]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [172]:
west_coast_venues = getNearbyVenues(names=all_cities['Neighbourhood'],
                                   latitudes=all_cities['Latitude'],
                                   longitudes=all_cities['Longitude']
                                  )

Alamo Square
Anza Vista
Balboa Park
Bayview
Belden Place
Bernal Heights
Buena Vista
Castro
China Basin
Chinatown
Civic Center
Cole Valley
Corona Heights
Cow Hollow
Crocker-Amazon
Dogpatch
Dolores Heights
Duboce Triangle
Embarcadero
Excelsior
Fillmore
Financial District
Financial District South
Fisherman's Wharf
Forest Hill
Forest Knolls
Glen Park
Haight-Ashbury
Hayes Valley
Hunters Point
India Basin
Ingleside
Inner Sunset
Japantown
Jordan Park
Laguna Honda
Lake Street
Lakeside
Lakeshore
Little Saigon
Lone Mountain
Lower Haight
Lower Pacific Heights
Marina District
Mission Bay
Mission District
Mount Davidson
Nob Hill
Noe Valley
North Beach
North of Panhandle
Oceanview
Pacific Heights
Parkmerced
Parkside
Parnassus
Portola
Portola Place
Potrero Hill
Presidio
Presidio Heights
Richmond District
Rincon Hill
Russian Hill
Saint Francis Wood
Silver Terrace
South Beach
South of Market
South Park
Sunnydale
Sunnyside
Sunset District
Telegraph Hill
Tenderloin
Treasure Island
Twin Peaks
Union Square

In [175]:
print(west_coast_venues.shape)
west_coast_venues.head()

(9981, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alamo Square,37.776357,-122.434694,Alamo Square,37.776062,-122.433622,Park
1,Alamo Square,37.776357,-122.434694,Painted Ladies,37.77612,-122.433389,Historic Site
2,Alamo Square,37.776357,-122.434694,Alamo Square Dog Park,37.775878,-122.43574,Dog Run
3,Alamo Square,37.776357,-122.434694,The Independent,37.775573,-122.437835,Rock Club
4,Alamo Square,37.776357,-122.434694,The Mill,37.776425,-122.43797,Bakery


In [178]:
print('There are {} uniques categories.'.format(len(west_coast_venues['Venue Category'].unique())))

There are 430 uniques categories.


In [179]:
west_coast_venues_cities = west_coast_venues.merge(all_cities, how='left', left_on = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude'], right_on=['Neighbourhood', 'Latitude', 'Longitude'])

In [185]:
west_coast_venues_cities= west_coast_venues_cities.drop(['Neighbourhood', 'Latitude', 'Longitude'], axis=1)

In [187]:
west_coast_venues_cities.shape

(9981, 8)

In [188]:
west_coast_venues_cities.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,City
0,Alamo Square,37.776357,-122.434694,Alamo Square,37.776062,-122.433622,Park,San Francisco
1,Alamo Square,37.776357,-122.434694,Painted Ladies,37.77612,-122.433389,Historic Site,San Francisco
2,Alamo Square,37.776357,-122.434694,Alamo Square Dog Park,37.775878,-122.43574,Dog Run,San Francisco
3,Alamo Square,37.776357,-122.434694,The Independent,37.775573,-122.437835,Rock Club,San Francisco
4,Alamo Square,37.776357,-122.434694,The Mill,37.776425,-122.43797,Bakery,San Francisco
