@author: Fernando Trevisan Donati

Coursera Applied Data Science Capstone

In [1]:
import requests
import lxml
import pandas as pd
import numpy as np
pd.options.display.max_rows = 999

# Part 1

### The following cells are for part 1 of the Week 3 Assignment

#### Getting data

First, we get the data from the wikipedia page using pandas' `read_html` method

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df = pd.read_html(url)[0]

Here's our unprocessed dataframe:

In [3]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### Processing data

Now we need to get rid of rows with "Not assigned" boroughs:

In [4]:
df = df[df["Borough"]!="Not assigned"]

And here's the result:

In [5]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


Next, we group rows with the same postal code, joining the neighborhoods (separating them with a comma):

In [6]:
df.groupby(["Postal Code", "Borough"])["Neighborhood"].apply(', '.join).reset_index()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Finally, we assign the borough to the neighborhood to every row that has a "Not assigned" neighborhood

In [7]:
df["Neighborhood"] = np.where(df["Neighborhood"] == "Not assigned", df["Borough"], df["Neighborhood"])
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Final dataframe shape

This is our final dataframe shape

In [8]:
df.shape

(103, 3)

# Part 2

### The following cells are for part 2 of the Week 3 Assignment

As previous atempts to use geocoder calls were unsuccessfully, let's load the csv containing geospatial data

In [9]:
df_latlng = pd.read_csv("./Geospatial_Coordinates.csv")

Let's check the column names

In [10]:
df_latlng.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [11]:
df.columns

Index(['Postal Code', 'Borough', 'Neighborhood'], dtype='object')

Ok! We can merge them easily

In [12]:
df = df.merge(df_latlng, on='Postal Code')

And here's the result:

In [13]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# Part 3

### The following cells are for part 3 of the Week 3 Assignment

In [14]:
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

In [15]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="coursera_capstone")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


First, let's check on the map where the neighborhoods are located:

In [16]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next step is to gather info about our neighborhoods. Let's make a call to the Foursquare API, using a method to loop through all neighborhoods and get venue info for each of them.

In [17]:
CLIENT_ID = 'IZLDFSOQETUI4UWNX2B02TENPFYLQSCN1E3FZ5FIQRUDY0BC' # your Foursquare ID
CLIENT_SECRET = 'K43EPJEHQHMEUVLMAQAO1UJLSWPG3SDNSJ3JRDZD4BF05PVH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [18]:
LIMIT = 100
radius = 500

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
df_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

In [20]:
# one hot encoding
df_venues_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")
df_venues_onehot['Neighborhood'] = df_venues['Neighborhood'] 

df_venues_mean = df_venues_onehot.groupby('Neighborhood').mean().reset_index()

Now, let's cluster them into 10 clusters. We dropped the neighborhood to avoid overfitting (the actual neighborhood name shouldn't influence on neighborhoods similarity).

In [21]:
# set number of clusters
kclusters = 10

df_clustering = df_venues_mean.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

In [22]:
df_clustered = df_venues_mean.copy()
df_clustered.insert(0, 'Cluster Labels', kmeans.labels_)
df_clustered = df_clustered.join(df.set_index('Neighborhood'), on='Neighborhood')

Finally, we create a map to show what those clusters look like on the real city :D

In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_clustered['Latitude'], df_clustered['Longitude'], df_clustered['Neighborhood'], df_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [24]:
df_clustered[["Cluster Labels", "Neighborhood"]].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
0,5
1,80
2,1
3,3
4,2
5,1
6,2
7,1
8,1
9,1


As we can see, we got one big cluster and several smaller clusters. Let's try with a bigger k, to see if we can break that big cluster.

In [25]:
# set number of clusters
kclusters = 15

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

In [26]:
df_clustered = df_venues_mean.copy()
df_clustered.insert(0, 'Cluster Labels', kmeans.labels_)
df_clustered = df_clustered.join(df.set_index('Neighborhood'), on='Neighborhood')

Let's see what it looks like now:

In [27]:
df_clustered[["Cluster Labels", "Neighborhood"]].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,1
2,72
3,2
4,1
5,3
6,8
7,1
8,1
9,1


Now to the map again:

In [28]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_clustered['Latitude'], df_clustered['Longitude'], df_clustered['Neighborhood'], df_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The big cluster is still there, so it seems there is indeed a big correlation between these neighborhoods. Let's "zoom out" to simplify our visualization.

In [29]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

In [30]:
df_clustered = df_venues_mean.copy()
df_clustered.insert(0, 'Cluster Labels', kmeans.labels_)
df_clustered = df_clustered.join(df.set_index('Neighborhood'), on='Neighborhood')

In [31]:
df_clustered[["Cluster Labels", "Neighborhood"]].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
0,8
1,82
2,1
3,5
4,1


In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_clustered['Latitude'], df_clustered['Longitude'], df_clustered['Neighborhood'], df_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

With fewer clusters, the same structures appear and we have a much easier time finding the different color codes. :D