# IBM Data Science
## Coursera Capstone notebook Week 3
This notebook will be used to analyse location data in Toronto for the capstone project of the IBM Data Science course.

**All 3 parts are in this notebook - please scroll to the appropriate part**

### Part 1: Setting up the notebook
In this section, we install the necessary packages, scrape the data from <a href=https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M>wikipedia</a> using Beautiful soup, and explore and clean the data

In [1]:
"""install the necessary packages"""
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json 
import numpy as np
#!pip install beautifulsoup4 ## These are commented out as the packages are now installed
#!pip install lxml
#!conda install -c conda-forge geopy --yes
#!pip install requests
from bs4 import BeautifulSoup as bs
import requests
pd.set_option("display.precision", 3)

In [2]:
"""use beautiful soup to import the data"""
source_html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text #obtains source code as text
soup = bs(source_html, 'lxml') #uses beautiful soup to parse the source code
# print(soup.prettify()) #prints the html with appropriate indents - this was used to identify which arguments to use to find the table etc.
wikitable = soup.tbody #accesses just the table

In [3]:
"""Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
def tableDataText(table):   
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td') ) # data row       
    return rows

In [4]:
wikiclean = tableDataText(wikitable) #apply the method above to our table from wikipedia

nbhd = pd.DataFrame(wikiclean[1:], columns=wikiclean[0]) #convert to a dataframe

nbhd = nbhd[nbhd.Borough != 'Not assigned']#remove any boroughs with 'not assigned'
nbhd['Neighbourhood'] = nbhd['Neighbourhood'].replace("Not assigned",nbhd['Borough']) #replace not assigned with borough name

nbhd2 = nbhd.groupby(['Postcode'])['Neighbourhood'].apply(", ".join) #groups neighbourhood with same postcode, add comma between neighbourhood names 

nbhd2 = nbhd2.rename(index='Neighbourhoods',columns={'Neighbourhood':'Neighbourhoods'}) #change value title so can add to nbhd df
nbhd = nbhd.join(nbhd2,on='Postcode',how='inner') # joins the dfs using the post code as the index
nbhd = nbhd.drop(['Neighbourhood'],axis=1) #removes the original Neighbourhood column
nbhd = nbhd.drop_duplicates() #removes the duplicate entries

nbhd = nbhd.sort_values(by=['Postcode']) #sorts alphabetically
nbhd = nbhd.reset_index(drop=True) #resets index
nbhd.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [5]:
nbhd.shape

(103, 3)

### Part 2: obtaining latitude and longitude
In this part, we obtain the latitude and longitude data for each Borough

In [6]:
# !pip install geocoder # install the necessary package
import geocoder # import geocoder

In [7]:
latlong = pd.DataFrame(columns = ['Lat','Long']) #Create DF for latlong data
latlong

Unnamed: 0,Lat,Long


In [8]:
postal_code = nbhd.Postcode #get list of postcodes
for i in postal_code:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(i))  #using arcgis as google rejected the requests
                                                            #Please note some lat/longs might be different to the rubric
    lat = g.latlng[0]
    long = g.latlng[1]
    latlong = latlong.append({'Lat': lat,'Long': long},ignore_index=True) #fill lat long df

nbhd = pd.concat([nbhd,latlong],1)#combine the dfs

In [9]:
nbhd.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods,Lat,Long
0,M1B,Scarborough,"Rouge, Malvern",43.812,-79.196
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.786,-79.159
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.766,-79.175
3,M1G,Scarborough,Woburn,43.768,-79.218
4,M1H,Scarborough,Cedarbrae,43.77,-79.239


### Part 3: clustering and analysis
_In this part we explore the data and cluster neighbourhoods_

In [13]:
"""Import the necessary libraries for analysis"""
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim ## convert an address into latitude and longitude values
print('Installed!')

Installed!


_Let's first look at a map of Toronto with the boroughs marked._

In [26]:
# Identify Toronto's overall coordinates
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))
# create map of Toronto 
map_1 = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(nbhd['Lat'], nbhd['Long'], nbhd['Borough'], nbhd['Neighbourhoods']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_1)  
    
map_1

The geographical coordinates of Toronto are 43.653963, -79.387207.


_We can see that the postcodes of Toronto are roughly evenly spread across the city, with a denser cluster in the centre of town._


_Let's look closer at the centre of town:_

In [29]:
centre_districts = ['Downtown Toronto','East Toronto','West Toronto','Central Toronto','York','East York'] #define the centre districts
centre = pd.DataFrame(columns = ['Postcode','Borough','Neighbourhoods','Lat','Long']) #Create DF for latlong data
for district in centre_districts: #fill centre DF
    x = nbhd[nbhd['Borough'] == district]
    centre = centre.append(x, sort=False)

centre = centre.reset_index(drop=True)
centre.head()

Unnamed: 0,Postcode,Borough,Neighbourhoods,Lat,Long
0,M4W,Downtown Toronto,Rosedale,43.682,-79.378
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.668,-79.367
2,M4Y,Downtown Toronto,Church and Wellesley,43.667,-79.381
3,M5A,Downtown Toronto,Harbourfront,43.65,-79.359
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657,-79.378


In [33]:
# We're using Queen's Park as the centre of the map
address = "Queen's Park, Toronto, Ontario"
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_2 = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(centre['Lat'], centre['Long'],centre['Borough'],centre['Neighbourhoods']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='pink',
        fill_opacity=0.7,
        parse_html=False).add_to(map_2)  
    
map_2

_Now let's start using Foursquare to examine the neighbourhoods._

In [34]:
CLIENT_ID = 'YYIFSVUT20HJYMUFHSLFDBITN0EQ50VRTNHFHRF45N3DS34G' # your Foursquare ID
CLIENT_SECRET = 'WKIUN4RPKH1WC3KF44UYJACJ3ET3CLIFXSGVYDN5NXUAUKNB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version