# Applied Data Science Capstone

## IBM Data Science 

### By Stephen Ewing

This is my write up for IBM's Applied Data Science Capstone.  

## Get the SOUPs

First I'll use the BeautifulSoup package to scrape the table from this Wikipedia page.  <https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M>

In [43]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url, 'lxml')

## Make the PANDAs

This will snag all the table parts and put them together into a big ol' list.  The it'll get rid of the carrage returns.  Fianlly, it will grab every 3rd item to make each of the 3 columns.

In [44]:
list_thingy = []
for tr in soup.findAll("table"):
     for td in tr.find_all("td"):
         if not td.attrs.get('style'):
             list_thingy.append(td.text)

no_returns = [i.split('\n', 1)[0] for i in list_thingy]
col1 = no_returns[0::3]
del(col1[-1])
col2 = no_returns[1::3]
col3 = no_returns[2::3]

import pandas as pd
zip_list = pd.DataFrame({'Postcode': col1, 'Borough': col2, 'Neighborhood': col3})
zip_list.head(11)

Unnamed: 0,Borough,Neighborhood,Postcode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A
7,North York,Lawrence Manor,M6A
8,Queen's Park,Not assigned,M7A
9,Not assigned,Not assigned,M8A


## Cleaning Part 1

This gets rid of the unassigned boroughs.

In [45]:
no_unassined_boroughs = zip_list[zip_list['Borough'] != "Not assigned"]
no_unassined_boroughs.head(11)

Unnamed: 0,Borough,Neighborhood,Postcode
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A
7,North York,Lawrence Manor,M6A
8,Queen's Park,Not assigned,M7A
10,Etobicoke,Islington Avenue,M9A
11,Scarborough,Rouge,M1B
12,Scarborough,Malvern,M1B


## Cleaning Part 2

When the neighborhood doesn't have a name but the bourough does make the neighbourhood name the borough name.

In [46]:
import numpy as np
no_unassined_neighborhoods = no_unassined_boroughs.copy()
no_unassined_neighborhoods['Neighborhood'] = np.where(no_unassined_boroughs['Neighborhood'] == 'Not assigned', no_unassined_boroughs.Borough, no_unassined_boroughs.Neighborhood)
no_unassined_neighborhoods.head(11)

Unnamed: 0,Borough,Neighborhood,Postcode
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A
7,North York,Lawrence Manor,M6A
8,Queen's Park,Queen's Park,M7A
10,Etobicoke,Islington Avenue,M9A
11,Scarborough,Rouge,M1B
12,Scarborough,Malvern,M1B


## Cleaning Part 3

This groups the neighbourhoods that are in the same postcode together separated by columns.

In [47]:
grouped_postcodes = no_unassined_neighborhoods.groupby(['Postcode', 'Borough'], as_index = False)['Neighborhood'].agg(lambda x: "%s" % ', '.join(x))
grouped_postcodes.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [48]:
grouped_postcodes.shape

(103, 3)

## Adding on the Lat Lons

Here I'll read the given csv file into pandas using the provided URL.

In [49]:
import io
url = "https://cocl.us/Geospatial_data"
s = requests.get(url).content
latlon = pd.read_csv(io.StringIO(s.decode('utf-8')))
latlon.columns = ['Postcode', 'Latitude', 'Longitude']
latlon.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Bind the lat lons onto the grouped_postcodes dataframe.

In [50]:
p_codes_lat_lon = pd.merge(grouped_postcodes, latlon, on = 'Postcode', left_index = True)
p_codes_lat_lon.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


If we weren't give the geospatial data in a csv we could potentially use the geocoder package as follows.  Doesn't work for me though.

In [51]:
#!conda install -c conda-forge geocoder --yes
#import geocoder # import geocoder

#postal_code = 'M5G'

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#lat_lng_coords

## Explore and cluster the neighborhoods

### First load the required packages

In [52]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

### Get the lat/lon for Toronto

In [53]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="tc_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


### Create a map of Toronto with neighborhoods superimposed on top.

In [54]:
# create map of Toronto using latitude and longitude values
map_toronto_area = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lon, borough, neighborhood in zip(p_codes_lat_lon['Latitude'], p_codes_lat_lon['Longitude'], p_codes_lat_lon['Borough'], p_codes_lat_lon['Neighborhood']):
    label = 'Neighborhood: {}, Borough: {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_area)  
    
map_toronto_area

We'll filter down to just boroughs that contain the name 'Toronto'.

In [56]:
toronto_boroughs = p_codes_lat_lon[p_codes_lat_lon['Borough'].str.contains('Toronto')]
toronto_boroughs.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [57]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lon, borough, neighborhood in zip(toronto_boroughs['Latitude'], toronto_boroughs['Longitude'], toronto_boroughs['Borough'], toronto_boroughs['Neighborhood']):
    label = 'Neighborhood: {}, Borough: {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [59]:
#@ hidden

CLIENT_ID = 'UIFVJAC3EXI13L4C4SMAWPSROVO52RHTU5R1VPM4EFVKB4X3' # your Foursquare ID
CLIENT_SECRET = 'BDCXAB1IDWNJAEGUVT2WLYHPIITPODU1EZITVQCMIIMCHMCA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UIFVJAC3EXI13L4C4SMAWPSROVO52RHTU5R1VPM4EFVKB4X3
CLIENT_SECRET:BDCXAB1IDWNJAEGUVT2WLYHPIITPODU1EZITVQCMIIMCHMCA
