# Applied Data Science Location Data Project
# Segmenting and Clustering Neighborhoods in Toronto
 
## This workbook will be used to create a model for the location data project as part of the 
## Applied Data Science Capstone course.

First import the libraries and dependencies needed for this project.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

!pip install lxml html5lib beautifulsoup4

print('Libraries imported.')

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/64/28/0b761b64ecbd63d272ed0e7a6ae6e4402fc37886b59181bfdf274424d693/lxml-4.6.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 6.1MB/s eta 0:00:01     |███████████████▌                | 2.7MB 6.1MB/s eta 0:00:01
Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 30.9MB/s eta 0:00:01
Collecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: lxml, soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 lxml-4.6.1 soupsieve-2.0.1
Libraries imported.


First we need to obtain the Postal Codes for Toronto, which are those codes that start with the letter M.  These can be found on a Wikipedia website, and will help sort neighborhoods by borough.
Pandas' read_html function will traverse through the webpage looking for tabular data and convert tables into a list of dataframes. Since the Canada postcodes page has only one table, the dataframe will be the first element of the list.
(Code thanks to Krishnakanth Allika.)

In [2]:
df_postal=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Cleaning and reformating the PostCode dataframe

In [7]:
# Removing codes where the Borough is "Not assigned".
df_postal = df_postal[df_postal.Borough != "Not assigned"]

# No need to combine neighbourhoods.  The Wiki table has already made the necessary adjustments.

# Where the Neighbourhood is "Not assigned", use the Borough name for the neighbourhood.
df_postal['Neighbourhood'] = np.where(df_postal['Neighbourhood']=='Not assigned',
                                      df_postal['Borough'], df_postal['Neighbourhood'])

#reset the index
df_postal = df_postal.reset_index()

#Check format
df_postal.head()    

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
print('The dataframe has {} rows'.format(df_postal.shape[0]))

The dataframe has 103 rows
