<h1>Applied Data Science Capstone Project</h1>

<h2>Assignment #2:  Segmenting and Clustering Neighborhoods in Toronto</h2>

<h2>Part 02. Working with geodata</h2>

<h3>1. Prepare all the necessary stuff</h3>

In [2]:
# import required libraries
import requests, csv
import pandas as pd
from bs4 import BeautifulSoup

<h3>2. Working with the table 'List of postal codes of Canada: M' from Wikipedia</h3>

<h4>1. Retrieve the data and create a dataframe</h4>

In [3]:
# retrieve the data from Wikipedia
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(source.content, 'lxml')

# There are several tables on the page, so we need to find the exact one:
table = soup.find('table', class_='wikitable sortable')

In [4]:
# Prepare the csv file: 
csv_file = open('neighborhoods.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['PostalCode', 'Borough', 'Neighborhood'])

33

In [5]:
# Find all rows and columns of the table and write the data into csv file:
# Since table headers use the [th] tag, it will throw an error, so there must be an exception.
for items in table.find_all('tr')[1::1]:
    entries = items.find_all('td')
    try:
        PostalCode = entries[0].get_text(strip=True)
        Borough = entries[1].get_text(strip=True)
        Neighbourhood = entries[2].get_text(strip=True)
    except IndexError:
        pass
    csv_writer.writerow([PostalCode, Borough, Neighbourhood])

csv_file.close()

In [6]:
# Create the dataframe:
neighborhoods = pd.read_csv('neighborhoods.csv')

<h4>2. Process the data</h4>

In [7]:
# Exclude rows that don't have an assigned borough
neighborhoods = neighborhoods[neighborhoods.Borough != 'Not assigned']

# If a neighborhoods exists in several postal codes, combine the neighborhoods
neighborhoods = neighborhoods.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

# Change unassigned neighborhood to the same value as borough
neighborhoods.loc[neighborhoods['Neighborhood'] == 'Not assigned', 'Neighborhood'] = neighborhoods['Borough']

<h4>3. See the result</h4>

In [8]:
# see the dataframe
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
# Print the number of rows of the dataframe
neighborhoods.shape

(103, 3)

<h3>2. Adding geocodes</h3>

<h4>1. Acquire the data</h4>

<p><em>Disclaimer:</em> as they said that the Geocoder package might be buggy, I did not want to risk it, so I downloaded the data from the csv file, kindly provided by the Instructor</p>

In [10]:
# acquire the data
geocodes = pd.read_csv('https://cocl.us/Geospatial_data', header=0, names=['PostalCode', 'Latitude', 'Longitude'])

<h4>2. Add location data to the dataframe</h4>

In [11]:
# merge dataframes
neighborhoods_geo = pd.merge(neighborhoods, geocodes, on='PostalCode')

<h4>See the result</h4>

In [12]:
neighborhoods_geo.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
