# Coursera Capstone Week 3: Toronto Neighborhoods Project

There are 3 parts in this notebook. In part 1, I learn to do my first data scrape from the web using BeautifulSoup. In part 2, I clean the data obtained from the web scrape. In part 3, I cluster the neighborhoods of Toronto, Canada.

##### Notes:
In part 1, there are some lengthy print outs when viewing via github. You may want to use the side scroller to quickly pass through these after a quick look of each print out.

In part 3, folium does not render in github. It may be best to visit https://nbviewer.jupyter.org/ and copy and past the url to this github into the prompt for full view of outputs if you do not wish to download this notebook.

## Part 1: First Data Scrape of Wikipedia

In Part 1 I am learning how to scrape data from a website (hopefully anyway).

Website to be scraped: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

This is a list of neighborhoods by postal code within the city of Toronto, Canada. The aim is to scrape the wikipedia page; create a dataframe containing the data in tabular format, and then store the dataframe into a .csv file for further analysis.

Ultimately, these neighborhoods will be used to obtain gps coordinates to obtain further data for various venues through the Foursquare API and form cluster groups of these venues.

In [1]:
# import necessary libraries

import numpy as np
import pandas as pd

from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
# define the function that takes the desired url 
# and read/store its contents

def grab_html_contents(url):
    html = urlopen(url)
    html_page = html.read()
    html.close()
    soup = BeautifulSoup(html_page, 'html.parser')
    return soup

In [3]:
# Look at html page contents of desired page
# and parse through html tables to find the table desired

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html_content = grab_html_contents(url)

tables = html_content.find_all('table')
for table in tables:
    print(table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

In [4]:
# After inspecting the html, we see that 'wikitable sortable' 
# is the table we need so now we'll loop over the table
# data and perform the scrape

table = html_content.find('table', 
                     {'class': 'wikitable sortable'})

rows = table.find_all('tr')

# create .csv file for data to be saved to

file_name = 'toronto_postal_data.csv'
f = open(file_name,'w')

headers = 'PostalCode,Borough,Neighborhood\n'

f.write(headers)

# postal_data = []
# borough_data = []
# neighborhood_data = []

for index, row in enumerate(rows):
    cells = row.find_all('td')
    if len(cells) > 1:
        postal_data = cells[0].text.strip(' ')
        borough_data = cells[1].text.strip(' ')
        neighborhood_data = cells[2].text.strip()
        print('Obersvation {}'.format(index))
        print('Postal Code: ' + postal_data)
        print('Borough: ' + borough_data)
        print('Neighborhood: ' + neighborhood_data)
        f.write(postal_data + ',' + borough_data + ',' + neighborhood_data + '\n')

f.close()

Obersvation 1
Postal Code: M1A
Borough: Not assigned
Neighborhood: Not assigned
Obersvation 2
Postal Code: M2A
Borough: Not assigned
Neighborhood: Not assigned
Obersvation 3
Postal Code: M3A
Borough: North York
Neighborhood: Parkwoods
Obersvation 4
Postal Code: M4A
Borough: North York
Neighborhood: Victoria Village
Obersvation 5
Postal Code: M5A
Borough: Downtown Toronto
Neighborhood: Harbourfront
Obersvation 6
Postal Code: M6A
Borough: North York
Neighborhood: Lawrence Heights
Obersvation 7
Postal Code: M6A
Borough: North York
Neighborhood: Lawrence Manor
Obersvation 8
Postal Code: M7A
Borough: Downtown Toronto
Neighborhood: Queen's Park
Obersvation 9
Postal Code: M8A
Borough: Not assigned
Neighborhood: Not assigned
Obersvation 10
Postal Code: M9A
Borough: Etobicoke
Neighborhood: Islington Avenue
Obersvation 11
Postal Code: M1B
Borough: Scarborough
Neighborhood: Rouge
Obersvation 12
Postal Code: M1B
Borough: Scarborough
Neighborhood: Malvern
Obersvation 13
Postal Code: M2B
Borough: No

In [5]:
toronto_df = pd.read_csv('toronto_postal_data.csv')
print('Number of obersvations: {} \n'.format(toronto_df.shape[0]))
print('Number of features: {} \n'.format(toronto_df.shape[1]))
toronto_df.head()

Number of obersvations: 287 

Number of features: 3 



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Part 2: Cleaning Data Scraped From Web

In this notebook I am cleaning the data obtained from wikipedia: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

First, import some necessary Python libraries.

In [6]:
# import necessary libraries
import numpy as np
import pandas as pd
import geocoder

Now read in the .csv file as a pandas dataframe and take a peak of the initial dataframe (raw scrape).

In [7]:
# create dataframe from .csv file that was scraped from wikipedia
toronto_df = pd.read_csv('toronto_postal_data.csv')

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Next I remove the 'Not assigned' boroughs. This eliminates all 'Not assigned' neighborhoods as well.  The info() method shows this.

In [8]:
# filter out 'Not assigned' values
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned']

toronto_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 2 to 285
Data columns (total 3 columns):
PostalCode      210 non-null object
Borough         210 non-null object
Neighborhood    210 non-null object
dtypes: object(3)
memory usage: 6.6+ KB


Just taking a peak to see the nice, neat structure of the data frame so far.

In [9]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Desired dataframe should have unique postal codes with the proper corresponding boroughs, and all neighborhoods within each postal code/borough pair should be merged into a list under the neighborhood column. To start this merge, I first create a unique list of tuples with unique postal code and its corresponding borough. Converting the zip object to a set removes all duplicates and then converting the set to a list allows me to iterate in the next portion of the merge.

In [10]:
# create list of tuples with unique postal code and corresponding unique borough
uni_list = zip(list(toronto_df['PostalCode']),list(toronto_df['Borough']))
uni_list = set(uni_list)
uni_list = list(uni_list)

To finish the merge I create a new, empty dataframe with proper column names. Then, I loop over the list of tuples defined above to merge the rows into the desired format as described above. Finally, I take a peak at the resulting dataframe.

In [11]:
# create new, filtered dataframe and take a (big) peak
toronto_cleaned_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])

for i, tuple_ in enumerate(uni_list):
    postal_list = list(toronto_df[toronto_df['PostalCode'] == tuple_[0]]['Neighborhood'])
    toronto_cleaned_df.loc[i] = [tuple_[0], tuple_[1],
                                 ', '.join(postal_list)]


toronto_cleaned_df.head(50)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3L,North York,Downsview West
1,M4A,North York,Victoria Village
2,M4C,East York,Woodbine Heights
3,M3J,North York,"Northwood Park, York University"
4,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights"
5,M6P,West Toronto,"High Park, The Junction South"
6,M4W,Downtown Toronto,Rosedale
7,M9L,North York,Humber Summit
8,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
9,M1S,Scarborough,Agincourt


Checking the number of observations and features in the new, cleaned dataframe.

In [12]:
# check number of observations and features
toronto_cleaned_df.shape

(103, 3)

We see that there are 103 observations (103 unique postal codes).

### Assigning Lat./Long. Coords. to Postal Codes

Geocoder has not been working for me and, as stated in the assignment description, geocoder is sometimes not the most reliable. So I have decided to use the .csv file (https://cocl.us/Geospatial_data) provided that lists the postal codes with their accompanying latitude and longitude coordinates. So I create a dataframe from the .csv and concatenate that to the cleaned dataframe. I sort both dataframes by postal code (ascending), drop the postal codes from the lat/long dataframe and then concatenate the lat/long dataframe to the cleaned dataframe to ensure all coordinates are matching their corresponding postal code.

In [13]:
# read dataframe from geospatial .csv file (provided in assignement description) 
# and sort rows by postal code
lat_long_df = pd.read_csv('Geospatial_Coordinates.csv')
lat_long_df.sort_values(by=['Postal Code'], inplace=True)

lat_long_df.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [14]:
# drop 'Postal Code' column from lat_long_df, sort the cleaned dataframe, and concatenate
lat_long_df.drop(['Postal Code'], axis=1, inplace=True)
toronto_cleaned_df.sort_values(by=['PostalCode'], inplace=True) # sorts cleaned dataframe (index is out of order)
toronto_cleaned_df.reset_index(drop=True,inplace=True) # resets the index and drops the old index
toronto_geo_df = pd.concat([toronto_cleaned_df, lat_long_df], axis=1, sort=False) # concatenate dataframes' columns

toronto_geo_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


# Part 3: Clustering Neighborhoods in Toronto, ON, Canada

I will cluster neighborhoods based on postal codes; in the rendered map the tags should display the postal code followed by the borough the code is contained within.

In [15]:
# importing necessary libraries for clustering and visuals

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# import folium for map rendering
import folium

In [16]:
# grabbing the lat/long coords for Toronto, Ontario, Canada

address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


Please note that since github does not display folium, if needed, you can visit https://nbviewer.jupyter.org/ and copy and past my github link into the url prompt. This will show the fully rendered maps without having to download the notebook and test it.

In [17]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng, borough, postal in zip(toronto_geo_df['Latitude'], toronto_geo_df['Longitude'], toronto_geo_df['Borough'], toronto_geo_df['PostalCode']):
    label = '{}, {}'.format(postal, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Clustering Within Scarborough

I'm a fan of Simon and Garfunkle's "Scarborough Fair." Even though this song is not based on this location in Canada, it just reminds me of the song. So I'll investigate clusters in the Scarborough borough.

In [18]:
# restrict the toronto dataset to a subset with Scarborough as the only borough
toronto_sub_df = toronto_geo_df[toronto_geo_df['Borough'].str.contains('Scarborough')].reset_index(drop=True)
toronto_sub_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [19]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng, label in zip(toronto_sub_df['Latitude'], toronto_sub_df['Longitude'], toronto_sub_df['PostalCode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto