# Toronto Neighbourhoods Project

# Question One

### First install bs4 for beautiful soup package

In [1]:
!conda install -c conda-forge bs4 --yes 

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - bs4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.0       |   py36h9f0ad1d_0         160 KB  conda-forge
    bs4-4.9.0                  |                0           4 KB  conda-forge
    soupsieve-1.9.4            |   py36h9f0ad1d_1          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         222 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/linux-64::beautifulsoup4-4.9.0-py36h9f0ad1d_0
  bs4                conda-forge/noarch::bs4-4.9.0-0
  soupsieve          conda-forge/linux-64::soupsieve-1.9.4-py36h9f0ad1d_1



Downloading and Extracting Packag

### Import Libraries

In [2]:
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
import folium
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


### Get the wikipedia page with a url request.

In [3]:
#url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050'
page = urllib.request.urlopen(url)

#### Get the soup!

In [4]:
soup = BeautifulSoup(page,  "html.parser")
#soup = BeautifulSoup(page, "lxml")
#soup = BeautifulSoup(page, "html5lib")

In [5]:
#print(soup.prettify())

### Here we will get the wiki table from wiki, then in the next 2 blocks 
### we will process the wiki table column by column

In [6]:
right_table=soup.find('table', class_='wikitable sortable')
#right_table

In [7]:
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

In [8]:
df_raw = pd.DataFrame(A,columns=['PostalCode'])
df_raw['Borough']=B
df_raw['Neighbourhood']=C
cols = df_raw.columns

### In the df above we can see all the entries have '\n' at the end!
### It's so annoying - we remove them in the next block and check the results

In [9]:
df_proc = pd.DataFrame([[ent.split('\n')[0] for ent in df_raw[col]] for col in cols]).T
df_proc.columns = cols
df_proc.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### We select the indices of all the entries with Borough = 'Not Assigned'

In [10]:
in_sel = [i for i, bor in enumerate(df_proc['Borough']) if bor != 'Not assigned']

### We select the rows using the indices we found above and reset the index.

In [11]:
df_proc = df_proc.iloc[in_sel,:]
df_proc.reset_index(drop = True)
df_proc.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


### Select the unique postcodes.

In [12]:
post_codes = df_proc['PostalCode'].unique()

#### In the next block we loop through each postcode...

### NB: We assume that each postcode has only one borough 

#### it's fine if one borough has several postcodes.

#### We collect the borough from the postcode, and all of the neighbourhoods as well
#### then stitch together the neighbourhoods as directed.

### Finally we reform the data frame.

In [13]:
# we check and find that none of the entries have missing, na, or 'Not assigned' neighbouroods

not_ass = sum(df_proc['Neighbourhood'] == 'Not assigned')
na = sum(df_proc['Neighbourhood'].isna())
missing = sum(df_proc['Neighbourhood'] == '')
print('problems:', not_ass, na, missing)

problems: 0 0 0


In [14]:
new_frame = []
for post_code in post_codes:
    sub_frame = df_proc[df_proc['PostalCode'] == post_code]
    borough   = sub_frame['Borough'].values[0]
    hoods     = sub_frame['Neighbourhood'].values
# Here we will check for 'Not assigned' as the neighbourhoods
# We simply filter them all out so if a postal code has an entry with a not assigned 
# it will be ignored in the neighbourhoods list
    hoods     = [hood for hood in hoods if hood != 'Not assigned']
    if len(hoods) == 0:
        print(len(hoods))
# if after filtering hoods is empty it means that there was only a 'Not assigned'
# entry (ies) and so we simply put in the borough as requested in the spec
    if len(hoods) == 0:
        hoods = borough
    hoods     = ', '.join(hoods)
    new_frame.append([post_code, borough, hoods])

df_proc = pd.DataFrame(new_frame)
df_proc.columns = cols

#### Check to see if any entries have neighbourhood not assigned

In [15]:
sum(df_proc['Neighbourhood'] == 'Not assigned')


0

#### Check to see if any entries have borough == Neighbourhood

In [16]:
print('The number of postcodes without even one neighbourhood assigned was:',
     sum(df_proc['Neighbourhood']==df_proc['Borough']))

The number of postcodes without even one neighbourhood assigned was: 0


### The number of entries was 103

In [17]:
df_proc.shape[0]

103

# Question 2

Here we will install the package geocoder and import it

In [18]:
!conda install -c conda-forge geocoder --yes 

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    click-7.1.2                |     pyh9f0ad1d_0          64 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    future-0.18.2              |   py36h9f0ad1d_1         714 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    pysocks-1.7.1              |   py36h9f0ad1d_1          27 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    ---------------------

In [19]:
import geocoder

In [20]:
#import geocoder # import geocoder
#postal_code = post_codes[0]
# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

the geocoder didn't work well - reverted to the csv file and added in 
the lattitude and longitude for each postcode

In [21]:
co_ords = pd.read_csv('Geospatial_Coordinates.csv', index_col = 0)
co_ords = co_ords.loc[df_proc['PostalCode'].values, :]
co_ords.index = df_proc.index

In [22]:
co_ord_cols = ['Latitude', 'Longitude']
df_proc = df_proc.assign(Latitude = co_ords['Latitude'].values)
df_proc = df_proc.assign(Longitude = co_ords['Longitude'].values)

In [23]:
df_proc.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


##### We will create our credentials to do the foursquare queries

In [24]:
CLIENT_ID = '0ZWEKPOSZMJ42IP2FK5P4YF2C1ZW51YYLJ2V4RENT3OMP4PX' # your Foursquare ID
CLIENT_SECRET = 'UF1DT3CFB4ZAEQZ1G2ZJXHATPN2HHUTMPJCCETHJLWCXO1DC' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0ZWEKPOSZMJ42IP2FK5P4YF2C1ZW51YYLJ2V4RENT3OMP4PX
CLIENT_SECRET:UF1DT3CFB4ZAEQZ1G2ZJXHATPN2HHUTMPJCCETHJLWCXO1DC


#### function to get and process the nearby venues into a dataframe

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Running the function

In [26]:
radius = 500
LIMIT = 100
toronto_venues = getNearbyVenues(names=df_proc['Neighbourhood'],
                                 latitudes=df_proc['Latitude'],
                                 longitudes=df_proc['Longitude']
                                )

#sub = df_proc[df_proc['Neighbourhood'] == 'Upper Rouge']
#print(sub)
#toronto_venues1 = getNearbyVenues(names=sub['Neighbourhood'],
#                                   latitudes=sub['Latitude'],
#                                   longitudes=sub['Longitude']
#                                  )

Parkwoods
Victoria Village
Harbourfront
Lawrence Heights, Lawrence Manor
Queen's Park
Islington Avenue
Rouge, Malvern
Don Mills North
Woodbine Gardens, Parkview Hill
Ryerson, Garden District
Glencairn
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Highland Creek, Rouge Hill, Port Union
Flemingdon Park, Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
East Birchmount Park, Ionview, Kennedy Park
Bayview Village
CFB Toronto, Downsview East
The Danforth West,

In [27]:
toronto_venues.groupby('Neighborhood').count().shape

(98, 6)

In [28]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 270 uniques categories.


### We encode the categories using the one hot encoding.

In [29]:
# one hot encoding

toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
print(toronto_onehot.shape)
toronto_onehot = toronto_onehot.drop(['Neighborhood'], axis = 1)
# add neighborhood column back to dataframe
toronto_onehot.insert(0, 'Neighborhood', toronto_venues['Neighborhood'].values)


(2116, 270)


In [30]:
###We group the venues per neighbourhood and work outthe relative frequencies.

In [31]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021505,...,0.010753,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010753,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### We perform the clustering and label the points.
It's noticeable that we have lost 5 rows - when we downloaded the data from foursquare
those had no data for nearby venues. Prior to the clustering we performed normalization.

In [32]:

df_proc_cluster = normalize(toronto_grouped.drop('Neighborhood', 1))
# set number of clusters
kclusters = 3

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_proc_cluster)

# check cluster labels generated for each row in the dataframe
sel = [i for i, hood in enumerate(toronto_grouped['Neighborhood'].values)]
df_proc_out = df_proc.loc[sel, :]
df_proc_out = df_proc_out.assign(Label = kmeans.labels_)

function for the most common venues

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# We create the map in folium

In [34]:
# create map

latitude  = 43.6532
longitude = -79.3832


map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_proc_out['Latitude'], df_proc_out['Longitude'], 
                                  df_proc_out['Neighbourhood'], df_proc_out['Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

# Suppose we explore the areas a bit - we choose the second area on the list and drill down into what is there.

In [35]:
neighborhood_latitude = df_proc.loc[1, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_proc.loc[1, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_proc.loc[1, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Victoria Village are 43.725882299999995, -79.31557159999998.


In [36]:
# get the URI
# type your answer here
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    latitude, 
    longitude, 
    VERSION, 
    radius, 
    LIMIT)
# get the json file.
results = requests.get(url).json()

We need a helper function - 

In [37]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Normalise and filter the data - we arrive at a dataframe of the nearby venues in the area.

In [38]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues = nearby_venues[nearby_venues['categories'] != 'Neighborhood']

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Indigo,Bookstore,43.653515,-79.380696
3,LUSH,Cosmetics Shop,43.653557,-79.3804
4,CF Toronto Eaton Centre,Shopping Mall,43.65454,-79.380677
5,M Square Coffee Co,Coffee Shop,43.651218,-79.383555
