# IBM Applied Data Science Capstone

This notebook will be used as the Week 3 Peer Assignment deliverable. 

### Importing Libraries

First, let's download any libraries

In [3]:
!pip install geopy
print('Geopy installed!')

Collecting geopy
  Downloading geopy-2.0.0-py3-none-any.whl (111 kB)
Collecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Geopy installed


In [None]:
!pip install pgeocode
print('PGeocode installed!')

In [4]:
!pip install folium
print('Foliium Installed!')

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Foliium Installed


In [5]:
!pip install beautifulsoup4
print('BeautifulSoup installed!')

BeautifulSoup installed!


In [6]:
!pip install lxml
print('lxml installed!')



Now, it's time to import the libraries

In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import pgeocode # convert a postal code into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library
from bs4 import BeautifulSoup # web scraping library

print('Libraries imported.')

Libraries imported.


## 1. Web Scraping

We need a list of Toronto's neighborhoods. Since there aren't any clean datasets regarding that, we must collect it by ourselves.

Wikipedia has such a list, thus we'll scrape this information using the BeautifulSoup library.

In [9]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url)

Create a BeautifulSoup object with the webpage HTML

In [16]:
bs = BeautifulSoup(source.content, 'lxml')

Find the data we need

In [22]:
table = bs.find('table')
table_rows = table.find_all('tr')

Transform that data into a more pandas-friendly format

In [23]:
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)

Create a DataFrame using the list we just created

In [120]:
df = pd.DataFrame(l, columns=['Postal Code','Borough','Neighborhood'])
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A\n,Not assigned\n,Not assigned\n
2,M2A\n,Not assigned\n,Not assigned\n
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n


Seems the data is all there, but we also got a uninvited '\n'.

This means we are gonna need to clean our data. The fastest way to do that is to use a regex.

In [121]:
df = df.replace(r'\n','', regex=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


We've successfully removed the '\n'. Nice!

Now, it's time to remove that first line since all columns are filled with 'None'.

In [122]:
df = df.iloc[1:]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Nice work! Now, let's look at our data value.

'Not assigned' doesn't mean much to us, does it? Remove it

In [123]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Our dataset is looking good. One thing we should look out for is duplicates Postal Codes.

To verify that, let's compare the length of the column with the number of distinct Postal Codes.

In [124]:
print('Column length is {}'.format(len(df['Postal Code'])))
print('Number of distinct values is {}'.format(df['Postal Code'].nunique()))

Column length is 103
Number of distinct values is 103


It's a match! There's no problem after all.

Finnaly, let's reset the index and check the size of our data.

In [125]:
df.reset_index(inplace=True,drop=True)
df.shape

(103, 3)

## 2. Get geospacial coordinates

In order to use the Foursquare API, we will need the latitude and longitude of each of our neighboors.

For that, we'll be using the PGeocoder Python package.

In [88]:
import pgeocode

Set the country to Canada

In [126]:
nomi = pgeocode.Nominatim('ca')

Let's take a sample to analyse the data structure

In [152]:
location = nomi.query_postal_code('M5G')
location

postal_code                                         M5G
country code                                         CA
place_name        Downtown Toronto (Central Bay Street)
state_name                                      Ontario
state_code                                           ON
county_name                                     Toronto
county_code                                 8.13339e+06
community_name                                      NaN
community_code                                      NaN
latitude                                        43.6564
longitude                                       -79.386
accuracy                                              6
Name: 0, dtype: object

We are only interested in the latitude and longitude for now.

Since there's a limit on how many times we can use geocode, let's store the results in a separate dataframe.

In [127]:
df_ll = df.apply(lambda row: nomi.query_postal_code(row['Postal Code']), axis=1)
df_ll.head()

Unnamed: 0,postal_code,country code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
0,M3A,CA,North York (York Heights / Victoria Village / ...,Ontario,ON,North York,,,,43.7545,-79.33,1.0
1,M4A,CA,North York (Sweeney Park / Wigmore Park),Ontario,ON,,,,,43.7276,-79.3148,6.0
2,M5A,CA,Downtown Toronto (Regent Park / Port of Toronto),Ontario,ON,Toronto,8133394.0,,,43.6555,-79.3626,6.0
3,M6A,CA,North York (Lawrence Manor / Lawrence Heights),Ontario,ON,North York,,,,43.7223,-79.4504,6.0
4,M7A,CA,Queen's Park Ontario Provincial Government,Ontario,ON,,,,,43.6641,-79.3889,


Now, we should add those fields to our dataframe

In [128]:
df[['Latitude', 'Longitude']] = df_ll[['latitude','longitude']]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


Second task complete! Move on to visualization.

## 3. Exploring Neighborhoods in Toronto

For the final part, let's create a map clustering the neighborhoods of Toronto.

First, we should select only the boroughs that relate to Toronto. Postal codes beginning with 'M' are the ones located in Toronto.

Also, remove any row with value NaN in Latitude or Longitude.

In [156]:
df_to = df[df['Postal Code'].str.startswith('M')]
df_to = df_to[~np.isnan(df_to['Latitude'])]
df_to = df_to[~np.isnan(df_to['Longitude'])]
print('New dataframe size is: {}'.format(df_to.shape))
df_to.head()

New dataframe size is: (102, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


Next, get Toronto's location on the map using geopy

In [145]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent = 'ca_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


Now, it's time to visualize Toronto and it's clustered neighborhoods

In [161]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, label in zip(df_to['Latitude'], df_to['Longitude'], df_to['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

In [162]:
map_toronto