<h2>Coursera Applied Data Science Capstone Notebook For Sid Mitzlaff</h2>

This notebook is for Sid Mitzlaff's capstone project code for the Coursera specialization Applied Data Science from IBM.

The project is to determine what neighborhoods would be best for a new retiree who is a jazz lover and wants to relocate to Los Angeles to live.

<h3>Setup</h3>

Import needed libraries and install where not already available.

In [1]:
import urllib.request
import re
import requests
import pandas as pd
import numpy as np

In [2]:
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

Solving environment: done

# All requested packages already installed.



<h3>Retrieve Neighborhood List</h3>

There are a few options for getting the list of Los Angeles neighborhoods online but the best option for me here was to get them from the Wikipedia page.

In [3]:
# regex is used to pull out the neighborhood names from the source html of the wiki page containing the list

matcher = re.compile('.*?<li><a href=\"/wiki/.*?\" title=\".*?\">(.*?)</a>');

url = ("https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles");

contents = urllib.request.urlopen(url);

# these strings unfortunately also get pulled from the page so we filter them out here

throwout = [ 'Los Angeles Historic Preservation Overlay Zones',
    'Other cities and areas in Los Angeles County',
    'History',
    'Timeline',
    'Transportation',
    'Culture',
    'Landmarks',
    'Historic sites',
    'Skyscrapers',
    'Demographics',
    'Crime',
    'Sports',
    'Media',
    'Music',
    'Notable people',
    'Lists',
    'Flag',
    'Mayor',
    'City Council',
    'Other elected officials',
    'Airport',
    'DWP',
    'Fire Department',
    'Police',
    'Public schools',
    'Libraries',
    'Port',
    'Transportation',
    'Geography of Los Angeles'
]

good = [];
for entry in contents:
    match = matcher.match(entry.decode('utf-8'))
    if (match):
        good.append(match.group(1))

right = list(filter(lambda x: x not in throwout, good))

# print out the neighborhood list so we can make sure its good

for neighborhood in right:
    print(neighborhood)

Angelino Heights
Arleta
Arlington Heights
Arts District
Atwater Village
Baldwin Hills
Baldwin Hills/Crenshaw
Baldwin Village
Baldwin Vista
Beachwood Canyon
Bel Air, Bel-Air or Bel Air Estates
Benedict Canyon
Beverly Crest
Beverly Glen
Beverly Grove
Beverly Hills Post Office
Beverly Park
Beverlywood
Boyle Heights
Brentwood
Brentwood Circle
Brentwood Glen
Broadway-Manchester
Brookside
Bunker Hill
Cahuenga Pass
Canoga Park
Canterbury Knolls
Carthay
Castle Heights
Central-Alameda
Central City
Century City
Chatsworth
Chesterfield Square
Cheviot Hills
Chinatown
Civic Center
Crenshaw
Crestwood Hills
Cypress Park
Del Rey
Downtown
Eagle Rock
East Gate Bel Air
East Hollywood
Echo Park
Edendale
El Sereno
Elysian Heights
Elysian Park
Elysian Valley
Encino
Exposition Park
Faircrest Heights
Fairfax
Fashion District
Filipinotown, Historic
Financial District
Florence
Flower District
Franklin Hills
Gallery Row
Garvanza
Glassell Park
Gramercy Park
Granada Hills
Green Meadows
Griffith Park
Hancock Park
H

<h3>Geocode Neighborhood List</h3>

Get latitude and longitude coordinates for the Los Angeles neighborhoods from the Google Maps APIs.

In [4]:
KEY='AIzaSyCE1UVTp7MwQIP8ClUYACxdOKBes3RxzMY'

In [5]:
# if a neighborhood name is found in california, add to the found list, if not, then add to a notfound list for reference and later research

found = [];
notfound = [];

for h in right:
    url = "https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=" + h + " CA&inputtype=textquery&fields=name,geometry&key=" + KEY
    results = requests.get(url).json()['candidates'] # ["candidates"] # ['groups'][0]['items']
    if len(results) == 0:
        notfound.append(h)
        continue
    found.append([results[0]['name'], results[0]['geometry']['location']['lat'], results[0]['geometry']['location']['lng']])

Print out the list of neighborhoods for which we could not geocode. In practice, this is part of the data cleaning phase of data science, and we would research to make sure we got the names right and if not, add code in the part of our process that builds the list of neighborhoods, to address it.

In [6]:
print("NEIGHBHORHOODS NOT FOUND")
print("-----------------------")
for x in notfound:
    print(x)

NEIGHBHORHOODS NOT FOUND
-----------------------
Broadway-Manchester
Canterbury Knolls
Chesterfield Square
East Gate Bel Air
Edendale
Manchester Square
Spaulding Square
Vermont Knolls
Whitley Heights
Wholesale District
Yucca Corridor


For the neighborhoods that were found, convert to a data frame and print.

In [7]:
mydf = pd.DataFrame(found, columns=['Neighborhood','Latitude','Longitude'])
print(mydf)

                                         Neighborhood   Latitude   Longitude
0                                    Angelino Heights  34.070289 -118.254796
1                                              Arleta  34.250459 -118.433835
2                                   Arlington Heights  34.042222 -118.318889
3                                       Arts District  34.041175 -118.238043
4                                     Atwater Village  34.117290 -118.261433
5                                       Baldwin Hills  34.006677 -118.350578
6                                            Crenshaw  34.018199 -118.340351
7                                     Baldwin Village  34.015091 -118.347656
8                                       Baldwin Vista  34.013456 -118.362737
9                                    Beachwood Canyon  34.119696 -118.321055
10                        Beverly Hillbillies Mansion  34.087064 -118.442167
11                              Benedict Canyon Drive  34.099571 -118.432332

<h3>Retrieve Venue Data From FourSquare</h3>

Use FourSquare to get the jazz clubs that are within a half mile of the center of each neighborhood.

In [8]:
CID='5QUOLFGNYXOXR0VTYM5HVDIX2G1OEKP2WZXBWDPXZ5FSSNHH'
CSECRET='HKMJULKIUAUME1RBAOTYGXHUBZODTR3V0Y3VPT2A4NAE4L04'

In [9]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=4400
    LIMIT=50
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&intent=browse&v={}&ll={},{}&radius={}&limit={}&categoryId=4bf58dd8d48988d1e7931735'.format(
            CID, 
            CSECRET, 
            20180605, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["venues"]
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            ) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude']
    
    return(nearby_venues)

In [10]:
la_venues = getNearbyVenues(names=mydf['Neighborhood'], latitudes=mydf['Latitude'], longitudes=mydf['Longitude'])

In [11]:
la_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
0,Angelino Heights,34.070289,-118.254796,The Lindy Loft,34.045716,-118.249656
1,Angelino Heights,34.070289,-118.254796,Grand Star Jazz Club,34.065152,-118.237591
2,Angelino Heights,34.070289,-118.254796,Blue Whale Bar,34.049884,-118.242114
3,Angelino Heights,34.070289,-118.254796,AB Studio,34.041696,-118.233934
4,Angelino Heights,34.070289,-118.254796,Fedora At 1st & Hope,34.056396,-118.250769
5,Angelino Heights,34.070289,-118.254796,Sound Forest,34.040374,-118.253639
6,Angelino Heights,34.070289,-118.254796,Musicals,34.040283,-118.248375
7,Angelino Heights,34.070289,-118.254796,Gaspar Jewelers,34.047300,-118.253600
8,Angelino Heights,34.070289,-118.254796,Stones & Gold,34.045966,-118.253961
9,Angelino Heights,34.070289,-118.254796,The Mezz Bar,34.047533,-118.249749


<h3>Get Neighborhood Jazz Club Counts</h3>

Now get the counts of jazz clubs for each neighborhood into a dataframe.

In [12]:
vendf = la_venues['Neighborhood'].value_counts().rename_axis('Neighborhood').to_frame('Counts')
vendf

Unnamed: 0_level_0,Counts
Neighborhood,Unnamed: 1_level_1
Civic Center,27
Financial District,26
Arts District,24
Hollywood Hills,20
University Park,16
Pico-Union,16
Historic Filipinotown,15
Larchmont Village,15
Alandele Park at Park La Brea,15
Mt Olympus Dr,14


<h3>Merge Data For Mapping</h3>

Merge the counts with the dataframe containing latitude and longitude.

In [13]:
newdf = pd.DataFrame.copy(mydf, deep=True)
adf = pd.merge(newdf, vendf, on='Neighborhood', how='inner')
adf

Unnamed: 0,Neighborhood,Latitude,Longitude,Counts
0,Angelino Heights,34.070289,-118.254796,12
1,Arleta,34.250459,-118.433835,2
2,Arlington Heights,34.042222,-118.318889,9
3,Arts District,34.041175,-118.238043,24
4,Arts District,34.041175,-118.238043,24
5,Atwater Village,34.117290,-118.261433,4
6,Baldwin Hills,34.006677,-118.350578,4
7,Crenshaw,34.018199,-118.340351,14
8,Crenshaw,34.018199,-118.340351,14
9,Baldwin Village,34.015091,-118.347656,7


<h3>Cluster</h3>

Produce three clusters using the latitude, longitude, and counts as the feature.

In [14]:
# set number of clusters
kclusters = 3

la_grouped_clustering = adf.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 2, 2, 0, 0, 0, 2, 0, 0, 2, 2, 0, 0,
       2, 2, 0, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 0, 0, 0, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 0, 0,
       0, 2, 0, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 0, 2,
       0, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 2, 1, 0, 0, 0,
       0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 2, 2, 0, 0],
      dtype=int32)

Create new dataframe containing the counts and cluster.

In [15]:
la_merged = adf

# add clustering labels
la_merged['Cluster Labels'] = kmeans.labels_

la_merged

Unnamed: 0,Neighborhood,Latitude,Longitude,Counts,Cluster Labels
0,Angelino Heights,34.070289,-118.254796,12,2
1,Arleta,34.250459,-118.433835,2,0
2,Arlington Heights,34.042222,-118.318889,9,2
3,Arts District,34.041175,-118.238043,24,2
4,Arts District,34.041175,-118.238043,24,2
5,Atwater Village,34.117290,-118.261433,4,0
6,Baldwin Hills,34.006677,-118.350578,4,0
7,Crenshaw,34.018199,-118.340351,14,2
8,Crenshaw,34.018199,-118.340351,14,2
9,Baldwin Village,34.015091,-118.347656,7,0


<h3>Create Map</h3>

Map the neighborhood centers and color code for the clusters.

In [16]:
map_clusters = folium.Map(location=[34.0522342, -118.2436849], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(la_merged['Latitude'], la_merged['Longitude'], la_merged['Neighborhood'], la_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>Explore Clusters</h3>

Take a look at the clusters to derive which area would be best for a newly retired jazz lover to live.

<h4>Cluster 0</h4>

In [17]:
la_merged.loc[la_merged['Cluster Labels'] == 0, la_merged.columns[[0,1,2,3]]]

Unnamed: 0,Neighborhood,Latitude,Longitude,Counts
1,Arleta,34.250459,-118.433835,2
5,Atwater Village,34.117290,-118.261433,4
6,Baldwin Hills,34.006677,-118.350578,4
9,Baldwin Village,34.015091,-118.347656,7
12,Benedict Canyon Drive,34.099571,-118.432332,1
13,Beverly Hills,34.073620,-118.400356,8
14,Beverly Glen,34.107716,-118.442596,1
16,United States Postal Service,34.073130,-118.394228,7
17,North Beverly Park,34.117792,-118.417771,2
20,Brentwood Circle,34.071783,-118.471068,2


<h4>Cluster 1</h4>

In [18]:
la_merged.loc[la_merged['Cluster Labels'] == 1, la_merged.columns[[0,1,2,3]]]

Unnamed: 0,Neighborhood,Latitude,Longitude,Counts
128,University Park,35.265152,-80.861303,16
134,Vermont Square,43.668359,-79.412681,13
140,Westdale,43.265136,-79.906333,1


<h4>Cluster 2</h4>

In [19]:
la_merged.loc[la_merged['Cluster Labels'] == 2, la_merged.columns[[0,1,2,3]]]

Unnamed: 0,Neighborhood,Latitude,Longitude,Counts
0,Angelino Heights,34.070289,-118.254796,12
2,Arlington Heights,34.042222,-118.318889,9
3,Arts District,34.041175,-118.238043,24
4,Arts District,34.041175,-118.238043,24
7,Crenshaw,34.018199,-118.340351,14
8,Crenshaw,34.018199,-118.340351,14
10,Baldwin Vista,34.013456,-118.362737,9
11,Beachwood Canyon,34.119696,-118.321055,11
15,Beverly Grove,34.073473,-118.376572,10
18,Beverlywood,34.049413,-118.395232,11
