<H1> Segmenting and Clustering Neighborhoods in Toronto </H1>

<H2> #1: Scrape Wikipedia Page </H2> <br>
Use a Notebook to build a code to scrape the table from the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

In [1]:
import numpy as np 
import pandas as pd 
import requests
import matplotlib as mpl
from sklearn.cluster import KMeans
from bs4 import BeautifulSoup

In [2]:
from urllib.request import urlopen

In [3]:
pc_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
result = requests.get(pc_url).text

In [4]:
soup = BeautifulSoup(result, 'xml')

Prettify soup to see which part to extract
#soup.prettify()

In [5]:
table=soup.find('table')

In [6]:
column_names = ['Postalcode','Borough','Neighborhood']
df_raw = pd.DataFrame(columns = column_names)

In [7]:
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df_raw.loc[len(df_raw)] = row_data

<b> Reviewing Table </b>

In [8]:
df_raw.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


<br><br><b> Data Cleaning </b>

1. Remove Borough that is not assigned
2. If Neighborhood is not assigned, then Neighborhood is the same as Borough

<br>

In [9]:
df_clean1 = df_raw[df_raw['Borough']!='Not assigned']
df_clean1.loc[df_clean1['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df_clean1['Borough']
df_clean1.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


<br><br> For Postalcodes that are the same, combine the Neighborhoods into a single line. 

In [10]:
df_clean2 = df_clean1.groupby(['Postalcode','Borough'], sort=False).agg( ', '.join)
df_clean2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
Postalcode,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,"Lawrence Heights, Lawrence Manor"
M7A,Downtown Toronto,Queen's Park


In [11]:
df_result=df_clean2.reset_index()
df_result.head(12)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [12]:
df_result.shape

(103, 3)

<br><br><br>
<h2> #2: Find the latitude and the longitude coordinates of each neighborhood. </h2>

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

We will use the Geocoder Python package to find the coordinates: https://geocoder.readthedocs.io/index.html.

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create a dataframe.
<br><br>

In [13]:
! pip install --upgrade geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 6.8MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [None]:
import geocoder

In [None]:
lat_lng_coords = None

while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(zip))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]