## This is a notebook for the week 3 project of Capstone Project

### First step: create a dataframe using the table found at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M for Toronto postal codes

### Use pands to read all tables from the webpage (i.e. the easy way to do this!)

In [5]:
import pandas as pd
wiki_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables = pd.read_html(wiki_page)
df = tables[0]
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


### Remove "Not assigned" Boroughs (as per instructions)

In [6]:
df = df[df.Borough != 'Not assigned']
df

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


### Check to see if any neighbhorhoods are still "not assigned" after removing "not assigned" boroughs

In [7]:
df[df.Neighborhood=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


Nope!  Looks like we don't have anything to fix there.

### Check to see if there are any duplicate postal codes remaining (multiple rows with same postal code)

In [8]:
print('# of rows is now: ', df.shape[0])
print('# of rows when looking at *unique values* of postal code: ', df['Postal Code'].value_counts().shape[0])
print("Since these match, we have no duplicate Postal Code Rows!")


# of rows is now:  103
# of rows when looking at *unique values* of postal code:  103
Since these match, we have no duplicate Postal Code Rows!


In [9]:
print('Last Step: Shape of df is: ', df.shape)

Last Step: Shape of df is:  (103, 3)


## Step 2: we need to get latitude and longitude values for each postal code

Couldn't get goecoder to work, so using the csv file online instead.


In [10]:
df_ll = pd.read_csv('http://cocl.us/Geospatial_data')
df_ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


It looks like this is sorted by Postal Code, but let's re-sort it just to be sure.

We'll then also sort our original df by Postal Code too, so they match up.

In [11]:
df_ll = df_ll.sort_values(by='Postal Code')

df = df.sort_values(by='Postal Code')
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now add lat. and long. columns to original df.

In [12]:
df['Latitude'] = df_ll['Latitude']
df['Longitude'] = df_ll['Longitude']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Step 3, map out clusters of neighborhoods in Toronto

#### First, let's import all the libraries and stuff we're going to need

In [13]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you need to install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you need to install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


#### Just focus on neighbhorhoods that include "Toronto" in the name, and look up longitude and latitude of Toronto

In [35]:
#Tried geolocator, but results ended up slightly off-center. 
#address = 'Toronto, ONT'
#geolocator = Nominatim(user_agent="ny_explorer")
#location = geolocator.geocode(address)
#latitude = location.latitude
#longitude = location.longitude
#print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

#long and lat of Toronto, thanks to manual Google search, instead
latitude = 43.6532
longitude = -79.3832

#create dataframe with just Tornoto Boroughs
toronto_data = df[ df['Borough'].str.find('Toronto') != -1 ]
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


#### Now make the folium map, labelling all the neighborhoods

In [38]:
# create map of New York using latitude and longitude values
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

TADA!  DONE!