# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [5]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


## 1. Download and Explore Dataset

We'll need some more libraries.

In [6]:
# import the library we use to open URLs
import urllib.request

This is the URL from where we will scrape the dataframe.

In [7]:
# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [8]:
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

In [9]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

In [11]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page)

Uncomment following cell only if you want to see all data scraped.

In [13]:
# print(soup.prettify())

In [19]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")
# all_tables

In [21]:
right_table=soup.find('table', class_='wikitable sortable')
# right_table

Next cell is to populate the table with the data scraped from the web page.

In [34]:
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

And here it is a glance of the first attempt.

In [72]:
df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighborhood']=C
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,Regent Park / Harbourfront\n


Still pretty messy. We'll need to delete all the \n strings.

In [73]:
df["Postal Code"] = df["Postal Code"].str.replace("\n", "")
df["Borough"] = df["Borough"].str.replace("\n", "")
df["Neighborhood"] = df["Neighborhood"].str.replace("\n", "")
df.head()

(180, 3)

And get rid of the rows with unassigned Neighborhoods.

In [75]:
df.drop(df.loc[df['Borough']=='Not assigned'].index, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [76]:
df = df.reset_index()
del df['index']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Now, we only need to merge the Neighborhood in the same line and separate them with commas.

In [90]:
df["Neighborhood"] = df["Neighborhood"].str.replace(" / ", ", ")
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [121]:
shape = df.shape[0]
print('The number of rows of this DataFrame is {}'.format(shape)+'.')

The number of rows of this DataFrame is 103.


Now on to the coordinates. Import .csv file and read it as a DataFrame with <code>pd.read_csv<code>.

In [124]:
dfc = pd.read_csv('https://cocl.us/Geospatial_data')
dfc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


And to make sure that the previous DataFrame was correctly populated.

In [126]:
shape2 = dfc.shape[0]
print('The number of rows of this second DataFrame of coordinates is still {}'.format(shape2)+'.')

The number of rows of this second DataFrame of coordinates is still 103.
