## Segmenting and Clustering Neighborhoods in Toronto | Part 1+2

__Part 1__ of this notebook will create a pandas dataframe with Toronto's postal codes, boroughs, and neighborhoods.<br>
__Part 2__ of this notebook will get the latitude and the longitude coordinates for each of Toronto's neighborhoods.<br>

### __Part 1:__ Create dataframe of Toronto's neighborhoods

First, I build the code to scrape the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

Note: I could have scraped the table with BeautifulSoup (click __here__ to see the code), but found a much easier solution using pandas read_html method (see next cell)

<!-- 
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).text

soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())
table = soup.find(lambda tag: tag.name=='table')
df = pd.read_html(str(table))[0]
df.rename(columns={'Postcode':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)
df.head()
-->

In [1]:
# Get the table with pandas 'read_html' method
import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.rename(columns={'Postcode':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Now, I edit/clean the data frame as instructed.

__1)__ Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [2]:
df.Borough.value_counts()  # 77 rows have a borough that is not assigned
print('Not assigned boroughs:',(df.Borough=='Not assigned').sum())

df.drop(df[df.Borough=='Not assigned'].index, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Not assigned boroughs: 77


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


__2)__ If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [3]:
i_na = df[df.Neighborhood=='Not assigned'].index  # only observation #6 has a non-assigned Neighborhood

df.loc[i_na,'Neighborhood'] = df.loc[i_na,'Borough']
df.loc[i_na,:]

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Queen's Park


__3)__ More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [4]:
# loop over unique postal codes and join all boroughs and neighborhoods for each postal code in a new dataframe
pcodes = df.PostalCode.unique()
df_Tor = pd.DataFrame(columns=df.columns)
 
for i,p in enumerate(pcodes):
    df_Tor.loc[i,'PostalCode']=p
    df_Tor.loc[i,'Borough']= ', '.join(df[df.PostalCode==p].Borough.unique())
    df_Tor.loc[i,'Neighborhood']= ', '.join(df[df.PostalCode==p].Neighborhood.unique())

df_Tor.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


__4)__ Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.<br> _Done_

__5)__ In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [5]:
df_Tor.shape

(103, 3)


### __Part 2:__ Get coordinates of Toronto's neighborhoods

First, I tried getting the coordinates using geocoder, but it didn't work since google denied the request (click __here__ to see my code for trying). Hence I used the csv file to get the coordinates.

<!-- 
#!conda install -c conda-forge geocoder --yes  # uncomment this line if you haven't installed geocoder yet
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None
postal_code = 'M5G'
i = 0  # to make sure the test loop ends at some point in case no result can be obtained

# loop until you get the coordinates
while(lat_lng_coords is None) and (i<=20):
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng
    print(i,':',g)
    i=i+1

if (lat_lng_coords != None):
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

    print(latitude)
    print(longitude)
-->

In [6]:
!wget -q -O 'Toronto_Lat_Lng.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


Now load the data into a pandas dataframe.

In [7]:
df_LL = pd.read_csv('Toronto_Lat_Lng.csv')
df_LL.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df_LL.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
# Before merging both data frames by PostalCode, check if they contain the same PostalCodes 
list(df_LL.PostalCode.sort_values()) == list(df_Tor.PostalCode.sort_values())

True

In [9]:
# Merge both data frames by PostalCode
df_TorLL = pd.merge(df_Tor, df_LL, on='PostalCode')
df_TorLL.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
