# Hi and welcome to my IBM Capstone Week 3 Assignment

I'm going to try and talk my way through my process as I go so that I can capture what I was doing/thinking at the time. These markdown cells should describe the process going on in the code cells below

---

## Part 1.

Firstly I am going to need to install beautifulsoup4 as I haven't used this yet in my environment. 

In [1]:
#conda install -c conda-forge beautifulsoup4

Now I will need to import the required dependencies for the webscraping element of this assignment.

In [2]:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

print('Libraries imported.')

Libraries imported.


Ok, now I need to get the data from the Wikipedia page on the Toronto Postal codes. 

We have been given the website url so I will add that to a url variable. 
Then I will use request to get the full html text of that url and BeautifulSoup with the lxml parser to give it to me in a form I can use. 

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

Having looked at the website through the Chrome inspector I know that the data I need sits within a table with a class of 'wikitable sortable' so I will try and pull out the table. 

I'm going to parse through all the table tags for rows in the table, these are labelled as 'tr' tags. Within the 'tr' tags are the 'td' tags that contain the information we need. I will find all of those from each row and assign to a list called cell. The cell 0th item will be the postcode, the 1st will be the borough and the 2nd the neighborhood. (For some reason the neighborhood cell has multiple lines so I need to strip out the text). 

I then want to add the results into 3 lists, one for each column, however I also need to concatenate the neighborhoods when the Postal Code is a repeat. I do this by looking up the last added postcode to the list, and if the current one matches, then add a comma and the new neighborhood into the neighborhood list.

In [4]:
# Get the table from the webpage
table = soup.find('table', class_='wikitable sortable')
table = table.tbody

# initialise lists
pc_list = []
bor_list = []
hood_list = []

# loop through the rows in the table
for row in table.find_all('tr'):
    # get the row as an indexed list
    cell = row.find_all('td')
    # the first row is headers and doesnt have 'td' so skip this row
    if len(cell) > 0:
        # extract the cell contents to variables
        postcode = cell[0].find(text=True)
        borough = cell[1].find(text=True)
        neighborhood = cell[2].find(text=True).strip()
        
        # check that the list has some values otherwise there in an index error
        if len(pc_list) < 1:
            pc_list.append(postcode)
            bor_list.append(borough)
            hood_list.append(neighborhood)
        # see if the postcode matches the last entered value in the list
        elif postcode == str(pc_list[-1]):
            # update just the neighborhood list with the added values
            hood_list[-1] = hood_list[-1] + ', ' + neighborhood
        # otherwise add a new row to all the lists
        else:
            pc_list.append(postcode)
            bor_list.append(borough)
            hood_list.append(neighborhood)


So we should now have 3 list of equal length extracted from the table. 

In [5]:
print(len(pc_list), len(bor_list), len(hood_list))

180 180 180


I can put these lists into a dataframe and start the process of cleaning and transforming to get the data in the final format that I need. 

Firstly I will change the "Not assigned" to None so that I can find and drop those rows easily later.

In [6]:
df_Toronto = pd.DataFrame()
df_Toronto['PostalCode'] = pc_list
df_Toronto['Borough'] = bor_list
df_Toronto['Neighborhood'] = hood_list
df_Toronto = df_Toronto.replace({'Not assigned': None})
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Regent Park"


Now I need to drop the rows where the Borough column is None, I will also need to reset the index of the dataframe

In [7]:
df_Toronto = df_Toronto[df_Toronto['Borough'].notnull()]
df_Toronto.reset_index(drop=True, inplace=True)
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,


One final thing to do is to make sure that any Null() values in the dataframe are fixed. This would be anytime the Borough and Neighborhood are the same, the Borough is filled out but the Neighborhood is left blank. I need to make the Neighborhood match the Borough.

In [8]:
df_new = df_Toronto[df_Toronto['Neighborhood'].isnull()]
print(df_new)

  PostalCode       Borough Neighborhood
4        M7A  Queen's Park         None


It looks like there is only one place this happens, I can correct in the original dataframe

In [9]:
df_Toronto.loc[4, 'Neighborhood'] = df_Toronto.loc[4,'Borough']
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


The last request in this section is to show the number of rows and column in my completed dataframe. 

In [10]:
df_Toronto.shape

(103, 3)

---

## Part 2.

We now need to get the latitude and longitude coordinates for the centres of each of the Boroughs. First I will install the geocoder package as I haven't used it here before.

In [11]:
#conda install -c conda-forge geocoder

In [12]:
# import geocoder # import geocoder
# postcode = 'M5G'

# # initialize your variable to None
# lat_lng_coords = None

# # loop until you get the coordinates
# while(lat_lng_coords is None):
#     g = geocoder.google('{}, Toronto, Ontario'.format(postcode))
#     lat_lng_coords = g.latlng

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]

# print(latitude, longitude)

The above code snippet took **way** too long to resolve and that was just one single postcode, so I downloaded the csv file provided in the project directions. 

I now need to convert the csv into a dataframe and merge it with the original `df_toronto` dataframe. The name of the first column doesn't quite match with the other dataframe so I will change it. 


In [13]:
df_coords = pd.read_csv('Geospatial_Coordinates.csv')

df_coords.rename(columns={
    'Postal Code': 'PostalCode'},
    inplace=True)

df_coords.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [49]:
df_toronto_merged = pd.merge(df_Toronto, df_coords, on='PostalCode')
df_toronto_merged.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


---

## Part3.

I have to import the libraries I will need for this section, including json, KMeans from sklearn and folium for the mapping. 

In [103]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

from geopy.geocoders import Nominatim

import numpy as np

Firstly I want to see the spread of the Postal Codes across Toronto, so I use the geolocator to get me the coordinates for Toronto. 

In [21]:
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="can_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


I am going to drop the following boroughs and all their associated postal codes for ease of generating the Foursquare data. 
* Mississauga
* Etobicoke
* Scarborough
* York
* North York
* East York
* Queens Park


In [92]:
todrop = ['Mississauga', 'Etobicoke', 'Scarborough', 'York', 'North York', 'East York', 'Queen\'s Park']
df_toronto_small = df_toronto_merged[~df_toronto_merged['Borough'].isin(todrop)]
print('The dataframe has {} boroughs and {} Postal Codes.'.format(
        len(df_toronto_small['Borough'].unique()),
        df_toronto_small.shape[0]
    )
)
df_toronto_small.reset_index(drop=True, inplace=True)
df_toronto_small

The dataframe has 4 boroughs and 38 Postal Codes.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
8,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


Now I use the dataframe with the latitude and longitude and the postal codes and boroughs to build a map of the area with the centres of each postal code marked.  

In [93]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, postalcode, borough in zip(df_toronto_small['Latitude'], 
                                         df_toronto_small['Longitude'], 
                                         df_toronto_small['PostalCode'], 
                                         df_toronto_small['Borough']):
    label = '{}, {}'.format(postalcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [None]:
CLIENT_ID = 'NKMGNI5DT5V4XSZW2XEY0TVV3AL3NUON0POHDJ1SSU25OJKN' # your Foursquare ID
CLIENT_SECRET = 'JGMQSRV1XD1DO1UFQDTQRDB4OXGPTL14K3XTO2KTMOWPQ15I' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

I can now get the Foursquare data for all the neighborhoods in Toronto, there was a function created in the New York example that I am reusing. 

In [95]:
radius = '500'
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Latitude', 
                  'Postal Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('Complete!')
    return(nearby_venues)

Now I can call that function with the parameters from the `df_toronto_small` dataframe. 

In [96]:
toronto_venues = getNearbyVenues(names=df_toronto_small['PostalCode'],
                                   latitudes=df_toronto_small['Latitude'],
                                   longitudes=df_toronto_small['Longitude']
                                  )


M5A
M5B
M5C
M4E
M5E
M5G
M6G
M5H
M6H
M5J
M6J
M4K
M5K
M6K
M4L
M5L
M4M
M4N
M5N
M4P
M5P
M6P
M4R
M5R
M6R
M4S
M5S
M6S
M4T
M5T
M4V
M5V
M4W
M5W
M4X
M5X
M4Y
M7Y
Complete!


In [97]:
print(toronto_venues.shape)
toronto_venues.groupby('Postal Code').count()

(1711, 7)


Unnamed: 0_level_0,Postal Latitude,Postal Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,4,4,4,4,4,4
M4K,44,44,44,44,44,44
M4L,22,22,22,22,22,22
M4M,37,37,37,37,37,37
M4N,4,4,4,4,4,4
M4P,8,8,8,8,8,8
M4R,22,22,22,22,22,22
M4S,33,33,33,33,33,33
M4T,3,3,3,3,3,3
M4V,15,15,15,15,15,15


Now we need to build a onehot dataframe that adds a column for each venue category and a 1 or 0 in the row for whether each post code contains it. 

In [98]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Postal Code'] = toronto_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped.shape

(38, 237)

I'm going to cluster the neighborhoods to find out which ones are similar. I'm going to estimate that 6 clusters will suffice. 

In [127]:
# set number of clusters
kclusters = 6

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 5, 2], dtype=int32)

In [128]:
df_toronto_small.insert(0, 'Cluster Labels', kmeans.labels_)

In [129]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_small['Latitude'], 
                                  df_toronto_small['Longitude'], 
                                  df_toronto_small['PostalCode'], 
                                  df_toronto_small['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

What I am observing is that there are a few clusters that are so different from the rest they just end up with 1 or 2 places in the cluster. However there are another 2 clusters with quite locations in. 