# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we will be scraping postal code data from Wikipedia for the city of Toronto. In part 1, we will then wrangle and clean the data in order to get information about venues in the city's neighborhoods using the __Foursquare API__. Then, in part 2, we will use a _k_-means algorithm to cluster the neighborhoods by most popular venues. Finally, in part 3, we will visualize the clusters on a map of Toronto.

## Part 0 - Load Necessary Libraries

In [2]:
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim  # convert an address into latitude and longitude values

import requests  # handle requests
from bs4 import BeautifulSoup  # web scraper

from pandas import json_normalize  # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium  # map rendering library

## Part 1 - Web Scraping

Scrape the Wikipedia page using an HTML parser.

In [112]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

req = requests.get(url)

soup = BeautifulSoup(req.content, 'html.parser')

Find the table in the HTML, and extract the data in each column. Then create a DataFrame to store all the data.

In [113]:
soup_table = soup.find('table')

# function to get the text from a list of HMTL tags
def get_text_map(lst):
    lst = list(map(lambda s: s.get_text(strip=True), lst))
    
    return lst

# get all the column data
data = soup_table.find_all('td')
data = get_text_map(data)

# get the data for each column
p_codes = data[::3]
bghs = data[1::3]
nbhs = data[2::3]

# replace slashes with commas for each neighborhood
nbhs = list(map(lambda s: s.replace(" / ", ", "), nbhs))

toronto_df = pd.DataFrame(zip(p_codes, bghs, nbhs), columns=['PostalCode', 'Borough', 'Neighborhood'])
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Filter out all the rows that have value of 'Not assigned' in the __Borough__ column. Also, for rows with a value of no assigned neighborhood, then assign the borough as the neighborhood.

In [115]:
# drop rows with no boroughs
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']

# confirm that there are no boroughs with the value 'Not assigned'
print(toronto_df['Borough'].value_counts())

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64


In [103]:
# replace 'Not assigned' neighborhoods with their boroughs
num_na_before = toronto_df[toronto_df['Neighborhood'] == ''].shape[0]

toronto_df['Neighborhood'] = np.where(toronto_df['Neighborhood'] == '', 
                                      toronto_df['Borough'], 
                                      toronto_df['Neighborhood'])

num_na_after = toronto_df[toronto_df['Neighborhood'] == toronto_df['Borough']].shape[0]

# confirm that there are no neighborhoods with the value 'Not assigned'
# and that the count before and after is the same
print("Number of not assigned neighborhoods before:", num_na_before)
print("Number of not assigned neighborhoods after:", num_na_after)
print('\n')
print(toronto_df['Neighborhood'].value_counts())

Number of not assigned neighborhoods before: 0
Number of not assigned neighborhoods after: 0


Downsview                                           4
Don Mills                                           2
Willowdale                                          2
Dufferin, Dovercourt Village                        1
Garden District, Ryerson                            1
                                                   ..
Roselawn                                            1
Del Ray, Mount Dennis, Keelsdale and Silverthorn    1
Runnymede, The Junction North                       1
St. James Town, Cabbagetown                         1
Cedarbrae                                           1
Name: Neighborhood, Length: 98, dtype: int64
