# Week 3 Assignment


***

## Section 1: Scraping and cleaning the data.

In the first section, we will use the requests and BeautifulSoup libraries to scrape the content from the Wikipedia article.

First, the standard imports:

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

Next, the requests library is used to download the Wikipedia article, and the article is then parsed with the BeautifulSoup library.

In [3]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
r.raise_for_status()

wiki_soup = BeautifulSoup(r.content)

Now, using the BeautifulSoup library, we search through the contents of the Wikipedia article for the contents of the table containing the list of postal codes and which boroughs/neighborhoods they correspond to.
Note that we are assuming that the aforementioned table is the first table present in the html document.

In [4]:
table = wiki_soup.find("tbody")
rowList = []
columns = ["Postal Code", "Borough", "Neighborhood"]

# We want to skip the first row of the table, as it only contains the column headers.
for row in table.find_all("tr")[1:]:
    rowList.append(dict(zip(columns, row.stripped_strings)))

The table is then converted to a pandas dataframe.

In [5]:
df = pd.DataFrame(rowList, columns=columns)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now, we need to filter out the rows of the table in which the Borough value is "Not assigned".

In [6]:
boroughFilter = df['Borough'] != "Not assigned"
df = df[boroughFilter]
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


We also need to check if there are any rows of the dataframe in which there is an assigned Borough, but no assigned Neighborhoods. If there are any rows with a Not assigned neighborhood, we will change the neighborhood to be the same as the borough.

In [7]:
neighborhoodFilter = df['Neighborhood'] != "Not assigned"
if neighborhoodFilter.sum() == df.shape[0]:
    print("There are NO \"Not assigned\" neighborhoods in the data!")
else:
    print("There are {} \"Not assigned\" neighborhoods in the data".format(neighborhoodFilter.sum()))


There are NO "Not assigned" neighborhoods in the data!


Finally, let's check the shape of our resulting data set:

In [8]:
print("The data set has", df.shape[0], "rows and", df.shape[1], "columns.")

The data set has 103 rows and 3 columns.


## Section 2: Geocoding the neighborhoods.

Now, we need to attach the latitude and longitude of each postal code to the dataframe containing the borough and neighborhood names.  
For this, we will use the geocoder python library.

In [9]:
!pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 5.4 MB/s eta 0:00:011
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [10]:
import geocoder

I tried to use the geocoder package to get the latitude and longitude of each postal code, however the code just hangs and does not return any data.

In [11]:
#coords = {"Latitude":[], "Longitude":[]}
#for postalCode in df['Postal Code']:
#    lat_lng = None
#    #print("Postal code:", postalCode)
#    while(lat_lng is None):
#        geo = geocoder.google('{}, Toronto, Ontario'.format(postalCode))
#        lat_lng = geo.latlng
#    coords["Latitude"].append(lat_lng[0])
#    coords["Longitude"].append(lat_lng[1])

Instead, I used the provided CSV file containing a list of each postal code and its geospacial coordinates.

In [12]:
lat_long = pd.read_csv("https://cocl.us/Geospatial_data")
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then, we can merge the dataframe containing the borough and neighborhood names with the one containing the latitude and longitude.  
We need to make sure that we are matching up the postal codes between the two different dataframes, so we join the two dataframes on the Postal Code column.

In [13]:
df = df.join(lat_long.set_index("Postal Code"), on="Postal Code")
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Section 3: Exploring the neighborhoods and prepping for clustering.

Now, let's explore the neighborhoods in Toronto.

First, we need a few more imports:

In [64]:
import matplotlib.cm as cm
import matplotlib.colors as colors
#!conda install -c conda-forge folium=0.5.0 --yes
!pip install folium
import folium



Now, to setup our Foursquare credentials.

In [20]:
CLIENT_ID = "KTMQKEY2LMCENF4ZCAZRYTY30HGWOH145XV3BMPXGOZZSCDL"
CLIENT_SECRET = "0JZSZDDPEPKH4QLXKOGF1CNJ4O0NZN52ZDZLGA2WZSYGUMKZ"
VERSION = '20180605'
LIMIT = 100

#### Let's take a look at a map of all of the different neighborhoods in our data set:

In [21]:
g = geocoder.osm("Toronto, Ontario")
toronto = g.latlng

In [22]:
map_toronto = folium.Map(location=toronto, zoom_start=11)

for lat, long, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

#### We will be using the `getNearbyVenues` function that was defined in a prior lab in order to get a list of the closest 100 venues to each neighborhood.

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now, we will use the Foursquare API to gather a list of the closest 100 venues to each neighborhood in the data set.

In [24]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'], latitudes=df['Latitude'], longitudes=df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [226]:
print(toronto_venues.shape)
toronto_venues.head()

(2121, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Brookbanks Pool,43.751389,-79.332184,Pool
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


#### Let's take a look at how many venues each neighborhood has:

In [283]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,3,3,3,3,3,3
"Alderwood, Long Branch",7,7,7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale East",33,33,33,33,33,33
"Willowdale, Willowdale West",4,4,4,4,4,4
Woburn,4,4,4,4,4,4
Woodbine Heights,7,7,7,7,7,7


#### Looks like some neighborhoods only have a very small number of venues that are close to them. We will only consider neighborhoods with a significant amount of nearby venues.  
We will define significant as having at least 10 nearby venues.

In [284]:
filtered_toronto_venues = toronto_venues.groupby('Neighborhood').filter(lambda x: x['Venue'].count() >= 10)
filtered_toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Berczy Park,58,58,58,58,58,58
"Brockton, Parkdale Village, Exhibition Place",24,24,24,24,24,24
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",18,18,18,18,18,18
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Canada Post Gateway Processing Centre,14,14,14,14,14,14
Central Bay Street,65,65,65,65,65,65
Christie,15,15,15,15,15,15
Church and Wellesley,78,78,78,78,78,78


In [285]:
print("There are {} unique categories.".format(len(filtered_toronto_venues['Venue Category'].unique())))

There are 250 unique categories.


#### We want to cluster these neighborhoods based on the venue categories. Given that venue category is a categorical variable, in order to cluster based on it we need to convert it to a one-hot encoding.

In [287]:
toronto_onehot = pd.get_dummies(filtered_toronto_venues[["Venue Category"]])

toronto_onehot['Neighborhood'] = filtered_toronto_venues['Neighborhood']
toronto_onehot = toronto_onehot[["Neighborhood"] + list(toronto_onehot.columns)[:-1]]
toronto_onehot.head()

Unnamed: 0,Neighborhood,Venue Category_Accessories Store,Venue Category_Adult Boutique,Venue Category_Airport,Venue Category_Airport Food Court,Venue Category_Airport Gate,Venue Category_Airport Lounge,Venue Category_Airport Service,Venue Category_Airport Terminal,Venue Category_American Restaurant,...,Venue Category_Train Station,Venue Category_Vegetarian / Vegan Restaurant,Venue Category_Video Game Store,Venue Category_Video Store,Venue Category_Vietnamese Restaurant,Venue Category_Warehouse Store,Venue Category_Wine Bar,Venue Category_Wine Shop,Venue Category_Wings Joint,Venue Category_Yoga Studio
9,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Now we will group the rows by neighborhood and calculate the frequency of the occurence of each venue category for that neighborhood.

In [289]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Venue Category_Accessories Store,Venue Category_Adult Boutique,Venue Category_Airport,Venue Category_Airport Food Court,Venue Category_Airport Gate,Venue Category_Airport Lounge,Venue Category_Airport Service,Venue Category_Airport Terminal,Venue Category_American Restaurant,...,Venue Category_Train Station,Venue Category_Vegetarian / Vegan Restaurant,Venue Category_Video Game Store,Venue Category_Video Store,Venue Category_Vietnamese Restaurant,Venue Category_Warehouse Store,Venue Category_Wine Bar,Venue Category_Wine Shop,Venue Category_Wings Joint,Venue Category_Yoga Studio
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556


#### Now, we want to determine the top 10 most common types of venue for each neighborhood.

In [290]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [291]:
from itertools import repeat, chain

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for indicator, value in zip(chain(indicators, repeat("th")), range(1, num_top_venues+1)):
    columns.append('{}{} Most Common Venue'.format(value, indicator))

toronto_venues_sorted = pd.DataFrame(columns=columns)
toronto_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in range(toronto_grouped.shape[0]):
    toronto_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
    
toronto_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Venue Category_Coffee Shop,Venue Category_Bank,Venue Category_Park,Venue Category_Fried Chicken Joint,Venue Category_Sandwich Place,Venue Category_Bridal Shop,Venue Category_Diner,Venue Category_Restaurant,Venue Category_Deli / Bodega,Venue Category_Middle Eastern Restaurant
1,"Bedford Park, Lawrence Manor East",Venue Category_Italian Restaurant,Venue Category_Thai Restaurant,Venue Category_Coffee Shop,Venue Category_Sandwich Place,Venue Category_Indian Restaurant,Venue Category_Pub,Venue Category_Butcher,Venue Category_Sushi Restaurant,Venue Category_Liquor Store,Venue Category_Fast Food Restaurant
2,Berczy Park,Venue Category_Coffee Shop,Venue Category_Cocktail Bar,Venue Category_Bakery,Venue Category_Seafood Restaurant,Venue Category_Restaurant,Venue Category_Beer Bar,Venue Category_Farmers Market,Venue Category_Cheese Shop,Venue Category_Pharmacy,Venue Category_Beach
3,"Brockton, Parkdale Village, Exhibition Place",Venue Category_Café,Venue Category_Performing Arts Venue,Venue Category_Nightclub,Venue Category_Coffee Shop,Venue Category_Breakfast Spot,Venue Category_Bakery,Venue Category_Burrito Place,Venue Category_Stadium,Venue Category_Bar,Venue Category_Intersection
4,"Business reply mail Processing Centre, South C...",Venue Category_Yoga Studio,Venue Category_Garden,Venue Category_Restaurant,Venue Category_Recording Studio,Venue Category_Pizza Place,Venue Category_Park,Venue Category_Light Rail Station,Venue Category_Garden Center,Venue Category_Fast Food Restaurant,Venue Category_Spa


## Part 4: Clustering the neighborhoods.

Now, using the KMeans module from scikit-learn, we will cluster each neighborhood based on their top 10 most common venue types. We will cluster the neighborhoods into 5 distinct clusters.

In [292]:
from sklearn.cluster import KMeans

In [293]:
k = 5

toronto_grouped_clustering = toronto_grouped.drop("Neighborhood", 1)

kmeans = KMeans(n_clusters=k, random_state=42).fit(toronto_grouped_clustering)

kmeans.labels_

array([0, 2, 2, 1, 1, 1, 2, 2, 1, 2, 0, 2, 2, 0, 2, 0, 1, 3, 2, 3, 2, 1,
       0, 1, 3, 2, 1, 4, 1, 3, 2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2,
       2, 0, 2, 1, 2], dtype=int32)

#### Now that we've determined the cluster label for each neighborhood, we need to combine the labels back into the original dataframe.

In [294]:
toronto_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

toronto_merged = toronto_merged.join(toronto_venues_sorted.set_index('Neighborhood'), on='Neighborhood', how="inner")


#### Finally, we can visualize the resulting cluster analysis on a map of Toronto.

In [298]:
clusterMap = folium.Map(location=toronto, zoom_start=11)

colormap = [cm.tab20(i) for i in range(k)]
colormap = list(map(colors.rgb2hex, colormap))

for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colormap[cluster-1],
        fill=True,
        fill_color=colormap[cluster-1],
        fill_opacity=1.0).add_to(clusterMap)
    
clusterMap