# Applied Data Science Capstone Project

This notebook will be used for the Capstone project for "Applied Data Science Capstone" course on 
[Coursera](https://coursera.org)

In [1]:
import pandas as pd
import numpy as np

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Week 03 assignment: Segmenting and Clustering Neighborhoods in Toronto

### Preparing the dataframe

In this part of the notebook, we extract data from a Wikipedia page, wrangle and clean it up.

First, we start by downloading the HTML page and the `lxml` package

In [2]:
!pip install --user lxml
!wget https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M -O wikipage.html
import lxml
####### You may need to reload the session for lxml to load correctly ################

--2019-10-29 15:20:52--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79018 (77K) [text/html]
Saving to: ‘wikipage.html’


2019-10-29 15:20:52 (1.04 MB/s) - ‘wikipage.html’ saved [79018/79018]



We'll use `pd.read_html` to extract tables from the html page.

Then, we select the table we want based on its headings

In [3]:
# Parse all tables in Wikipedia page
tables = pd.read_html('wikipage.html', header=0)
headings = ['Postcode','Borough','Neighbourhood']
for table in tables:
    current_headings = table.columns.values
    # If all headings match, this is the wanted table
    if all(current_headings == headings):
        break
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Next, let's drop all rows whose 'Borough' entry says 'Not assigned'

We are assuming consistent NaN repesentation here!

> Also note that the indexing is messed up

In [4]:
table = table[table['Borough'] != 'Not assigned']
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


We need to group table entries by 'Postcode' and apply a join on the Neighbourhood column.

We also need to fix the indexing (`.reset_index()`)

In [5]:
table = pd.DataFrame(table.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join)).reset_index()
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


At this stage, all cells have Boroughs! So, replace not assigned `Neighbourhood` with `Borough`'s value

In [6]:
for i in range(1,table.shape[0]):
    if table['Neighbourhood'][i] == 'Not assigned':
        table['Neighbourhood'][i] = table['Borough'][i]
any(table['Neighbourhood'] == 'Not assigned')

False

The dataframe has the following dimensions:

In [7]:
table.shape

(103, 3)

### Getting the geospacial data for Toronto

We have to merge Lat/Long information with the table we already have;

There's a **CSV file**, so we'll just use it instead of strangling with unreliable packages!

In [8]:
csv_url = 'https://cocl.us/Geospatial_data'
geo = pd.read_csv(csv_url)
# Fix column names so Postcode is the same in table and geo
geo.columns = ['Postcode','Latitude','Longitude']
geo.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# Merge the dataframes
df = pd.merge(table, geo, on='Postcode')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Analyzing the Toronto region

We need the Lat/Long coordinates of Toronto for map creation

In [10]:
import folium
!pip install --user geopy
from geopy.geocoders import Nominatim
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
retry = 1

# We assume after some attempts, the package will return lat/long values!
while (retry):
    try:
        location = geolocator.geocode(address)
        retry = 0
    except:
        print("Couldn't get data, retrying")
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Couldn't get data, retrying
The geograpical coordinate of Toronto are 43.653963, -79.387207.


Get an overview on how the Boroughs are distributed! 

In [11]:
# create map of Totonto using latitude and longitude values
tor_data = df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
tor_data.head()
map_tor = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(tor_data['Latitude'], tor_data['Longitude'],tor_data['Borough'], tor_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

We'll use foursquare for this task!

In [12]:
CLIENT_ID = 'UPZZT55FFCTLPXAISLSNW02U2Q0W2XHTQLWGTQXBPR2VWH4S' # your Foursquare ID
CLIENT_SECRET = 'IFBCAIB22Q3K3GELGKJI1D4HISCIK5YOWZYQLLW1CIFDXXTN' # your Foursquare Secret
VERSION = '20191029'

We'll extend the locations dataframe by looking for nearby venues for each Neighbourhood

In [13]:
import requests
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=20):
    """ A function to return nearby venues Dataframe for provided locations"""
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
tor_venues = getNearbyVenues(names=tor_data['Neighbourhood'],
                                   latitudes=tor_data['Latitude'],
                                   longitudes=tor_data['Longitude']
                            )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront,Regent Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvall

In [15]:
# Investigate how many venue categories we got!
print('There are {} categories.'.format(len(tor_venues['Venue Category'].unique())))

There are 163 categories.


Next, we'll convert categorical data to binary columns with the OneHot encoding technique

In [16]:
# one hot encoding
tor_onehot = pd.get_dummies(tor_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
tor_onehot['Neighbourhood'] = tor_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [tor_onehot.columns[-1]] + list(tor_onehot.columns[:-1])
tor_onehot = tor_onehot[fixed_columns]

tor_onehot.head()

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Arts & Crafts Store,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West,Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Let's Sort the top 10 most common venues per neighbourhoud by frequency

In [17]:
tor_grouped = tor_onehot.groupby('Neighbourhood').mean().reset_index()
num_top_venues = 10

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = tor_grouped['Neighbourhood']

for ind in np.arange(tor_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tor_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Asian Restaurant,Speakeasy,Bar,Pizza Place,Plaza,Steakhouse,Hotel,Café,Food Court,Concert Hall
1,Berczy Park,Seafood Restaurant,Beer Bar,Farmers Market,Tea Room,Bakery,Liquor Store,Steakhouse,Breakfast Spot,Concert Hall,Fountain
2,"Brockton,Exhibition Place,Parkdale Village",Coffee Shop,Café,Breakfast Spot,Yoga Studio,Bakery,Gym,Furniture / Home Store,Italian Restaurant,Climbing Gym,Pet Store
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Auto Workshop,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Service,Airport Lounge,Airport Terminal,Airport,Harbor / Marina,Coffee Shop,Sculpture Garden,Boat or Ferry,Bar,Boutique


### The clustering of Toronto neighbourhouds

We'll attempt cluster the grouped neighbourhouds into 5 clusters and add the cluster labels to the final dataframe

In [18]:
from sklearn.cluster import KMeans
kclusters = 5 # Number of clusters
# Drop Neighs column
tor_grouped_clustering = tor_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tor_grouped_clustering)

# add clustering labels to dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

tor_merged = tor_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
tor_merged = tor_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

tor_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Trail,Health Food Store,Pub,Neighborhood,Yoga Studio,Cuban Restaurant,Eastern European Restaurant,Dog Run,Discount Store,Diner
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,1,Greek Restaurant,Ice Cream Shop,Italian Restaurant,Yoga Studio,Cosmetics Shop,Pizza Place,Pub,Restaurant,Dessert Shop,Bubble Tea Shop
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,1,Park,Coffee Shop,Burger Joint,Brewery,Burrito Place,Liquor Store,Italian Restaurant,Intersection,Pub,Fast Food Restaurant
3,M4M,East Toronto,Studio District,43.659526,-79.340923,4,Coffee Shop,Café,Bakery,Bookstore,Ice Cream Shop,Gay Bar,Middle Eastern Restaurant,Cheese Shop,Sandwich Place,Seafood Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1,Park,Gym / Fitness Center,Swim School,Bus Line,Deli / Bodega,Ethiopian Restaurant,Eastern European Restaurant,Dog Run,Discount Store,Diner


Now, it remains to create a map for these clusters, coloring each marker in the map following the cluster it
belongs to

In [19]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_merged['Latitude'], tor_merged['Longitude'], tor_merged['Neighbourhood'], tor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters