### Data Scraping from Wikipedia

In the first section of the lab, data will be scraped from Wikipedia and converted into a Pandas dataframe to be worked with in the rest of the lab

In [1]:
!conda install -c conda-forge beautifulsoup4 --yes 

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    beautifulsoup4-4.8.2       |           py36_0         157 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following packages will be UPDATED:

    beautifulsoup4:  4.7.1-py36_1      --> 4.8.2-py36_0          conda-forge
    ca-certificates: 2019.11.27-0      --> 2019.11.28-hecc5488_0 conda-forge
    certifi:         2019.11.28-py36_0 --> 2019.11.28-py36_0     conda-f

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests

In [3]:
results = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(results.content,'html')

In [4]:
table = soup.find_all('table')[0]

In [5]:
toronto_df_list = pd.read_html(str(table))
print(toronto_df_list)

[    Postcode           Borough  \
0        M1A      Not assigned   
1        M2A      Not assigned   
2        M3A        North York   
3        M4A        North York   
4        M5A  Downtown Toronto   
5        M6A        North York   
6        M6A        North York   
7        M7A  Downtown Toronto   
8        M8A      Not assigned   
9        M9A         Etobicoke   
10       M1B       Scarborough   
11       M1B       Scarborough   
12       M2B      Not assigned   
13       M3B        North York   
14       M4B         East York   
15       M4B         East York   
16       M5B  Downtown Toronto   
17       M5B  Downtown Toronto   
18       M6B        North York   
19       M7B      Not assigned   
20       M8B      Not assigned   
21       M9B         Etobicoke   
22       M9B         Etobicoke   
23       M9B         Etobicoke   
24       M9B         Etobicoke   
25       M9B         Etobicoke   
26       M1C       Scarborough   
27       M1C       Scarborough   
28       M1C 

In [6]:
toronto_df = toronto_df_list[0]
toronto_df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Filtering out all Boroughs that have not been assigned

Replacing the value of any Neighbourhoods that have not been assigned to use the value of the Borough

In [7]:
toronto_df_filtered = toronto_df[toronto_df['Borough'] != "Not assigned"]
#toronto_df_filtered["Neighbourhood2"] = toronto_df_filtered["Neighbourhood"]
toronto_df_filtered.loc[toronto_df_filtered["Neighbourhood"] == "Not assigned", "Neighbourhood"] = toronto_df_filtered["Borough"]
toronto_df_filtered.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Grouping by Postcode and Borough.  Aggregating Neighbourhoods into lists

In [8]:
toronto_df_postcode_group = toronto_df_filtered.groupby(["Postcode", "Borough"], as_index=False).agg(lambda x: x.tolist())
toronto_df_postcode_group.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]


Converting python list int string with commas separating elements of the list

Splits list into string with elements separated by commas

In [9]:
toronto_df_neighbourhood = toronto_df_postcode_group
toronto_df_neighbourhood["Neighbourhood"] = [','.join(map(str, l)) for l in toronto_df_postcode_group['Neighbourhood']]
toronto_df_neighbourhood.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
toronto_df_neighbourhood.shape

(103, 3)

Credentials for accessing Geospatial lat long data were used here.  Removed for safety

In [11]:

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [90]:
toronto_geospace_df = toronto_df_neighbourhood.join(df_data_1, how="inner").drop("Postal Code", axis=1)
toronto_geospace_df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Gets only Boroughs that contain the phrase Toronto, to get a more focused search area

In [91]:
toronto_geospace_filtered = toronto_geospace_df[toronto_geospace_df["Borough"].str.contains("Toronto")]
toronto_geospace_filtered.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [14]:
!conda install -c conda-forge folium=0.5.0 geopy --yes 
from geopy.geocoders import Nominatim 
import folium 

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.0.1               |             py_0         575 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         766 KB

The following NEW packages will be INSTALLED:

    altair:        4.0.1-py_0  conda-forge
    branca:

In [15]:
toronto_global_positioner = Nominatim(user_agent="TO_searcher")
toronto_location = "Toronto, On"
to_latlong = toronto_global_positioner.geocode(toronto_location)
latitude = to_latlong.latitude
longitude = to_latlong.longitude

toronto_map = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, borough, neighborhood in zip(toronto_geospace_filtered["Latitude"], toronto_geospace_filtered["Longitude"], toronto_geospace_filtered["Borough"], toronto_geospace_filtered["Neighbourhood"]):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
toronto_map   

In [16]:
CLIENT_ID =  # your Foursquare ID
CLIENT_SECRET =  # your Foursquare Secret
VERSION = '20180605' # Foursquare API version


In [17]:
def getNearbyVenues(client_id, client_secret, version, names, latitudes, longitudes, limit, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            version, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        toronto_venue_results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in toronto_venue_results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
to_neighbourhood_venues_df = getNearbyVenues(CLIENT_ID,CLIENT_SECRET,VERSION,toronto_geospace_filtered["Neighbourhood"], toronto_geospace_filtered["Latitude"], toronto_geospace_filtered["Longitude"], 100,500)
to_neighbourhood_venues_df.head(5)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Glen Stewart Park,43.675278,-79.294647,Park
4,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood


In [19]:
to_neighbourhood_venues_df.shape

(1729, 7)

In [20]:
to_neigh_one_hot_df = pd.get_dummies(to_neighbourhood_venues_df[['Venue Category']], prefix="", prefix_sep="")
to_neigh_one_hot_df['Neighbourhood'] = to_neighbourhood_venues_df['Neighbourhood']
fixed_columns = [to_neigh_one_hot_df.columns[-1]] + list(to_neigh_one_hot_df.columns[:-1])
to_neigh_one_hot_df = to_neigh_one_hot_df[fixed_columns]
to_neigh_one_hot_df.head(5)

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
to_nv_grouped_df = to_neigh_one_hot_df.groupby("Neighbourhood").mean().reset_index()
to_nv_grouped_df.head(5)

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    #print(row_categories_sorted.index.values[0:num_top_venues])
    return row_categories_sorted.index.values[0:num_top_venues]

Labels venues that appear most frequently as 1st, 2nd, 3rd, 4th... etc most common.  Up to the number of common venues that I've chosen to display.

In [106]:
num_top_venues = 7

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']

for ind in np.arange(num_top_venues):
    try:

        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
to_nv_sorted = pd.DataFrame(columns=columns)
to_nv_sorted['Neighbourhood'] = to_nv_grouped_df['Neighbourhood']

for ind in np.arange(to_nv_grouped_df.shape[0]):
    to_nv_sorted.iloc[ind, 1:] = return_most_common_venues(to_nv_grouped_df.iloc[ind, :], num_top_venues)

to_nv_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Restaurant,Café,Thai Restaurant,Bar,Sushi Restaurant,Hotel
1,Berczy Park,Coffee Shop,Seafood Restaurant,Café,Bakery,Restaurant,Cocktail Bar,Cheese Shop
2,"Brockton,Exhibition Place,Parkdale Village",Café,Breakfast Spot,Coffee Shop,Grocery Store,Bar,Burrito Place,Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Skate Park,Brewery,Smoke Shop,Spa,Burrito Place
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Service,Airport Lounge,Airport Terminal,Boutique,Airport,Airport Food Court,Bar


In [33]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Setting up scaling for frequency data so Kmeans could be fit more effectively

In [108]:
scaler = StandardScaler()
to_nv_scaled_df = scaler.fit_transform(to_nv_grouped_df.drop("Neighbourhood", axis = 1))

In [117]:
kmeans_model = KMeans(n_clusters=5, random_state=0)
kmeans_model.fit(to_nv_scaled_df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [118]:
to_nv_sorted["Labels"] = kmeans_model.labels_

In [132]:
print(kmeans_model.labels_)

[3 3 1 1 1 1 3 2 1 4 3 0 1 1 3 1 3 1 1 1 3 1 1 3 1 1 1 1 1 1 0 3 3 3 3 1 1
 1 1]


In [119]:
to_nv_sorted.head(5)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,Labels
0,"Adelaide,King,Richmond",Coffee Shop,Restaurant,Café,Thai Restaurant,Bar,Sushi Restaurant,Hotel,3
1,Berczy Park,Coffee Shop,Seafood Restaurant,Café,Bakery,Restaurant,Cocktail Bar,Cheese Shop,3
2,"Brockton,Exhibition Place,Parkdale Village",Café,Breakfast Spot,Coffee Shop,Grocery Store,Bar,Burrito Place,Restaurant,1
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Skate Park,Brewery,Smoke Shop,Spa,Burrito Place,1
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Service,Airport Lounge,Airport Terminal,Boutique,Airport,Airport Food Court,Bar,1


In [120]:
from matplotlib.pyplot import cm
from matplotlib import colors

In [135]:
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
color_list = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in color_list]

In [136]:
toronto_geospace_labeled = toronto_geospace_filtered
toronto_geospace_labeled["Label"] = kmeans_model.labels_
toronto_geospace_labeled.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Label
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,3
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,3
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,1
43,M4M,East Toronto,Studio District,43.659526,-79.340923,1
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1


In [138]:
for cluster in range(0,5): 
    colour_group = folium.FeatureGroup(name='<span style=\\"color: {0};\\">{1}</span>'.format(rainbow[cluster-1],cluster))
    for lat, lng, neighbourhood, borough, cluster_label in zip(toronto_geospace_labeled['Latitude'], toronto_geospace_labeled['Longitude'], toronto_geospace_labeled['Neighbourhood'], toronto_geospace_labeled["Borough"], toronto_geospace_labeled['Label']):
        label = '{}, {}'.format(neighbourhood, borough)
        label = folium.Popup(label, parse_html=True)
        if int(cluster_label) == cluster: 
            folium.CircleMarker(
                (lat, lng),
                radius=5,
                popup=label,
                color=rainbow[cluster-1],
                fill=True,
                fill_color=rainbow[cluster-1],
                fill_opacity=0.7).add_to(colour_group)
    colour_group.add_to(toronto_map)
toronto_map    

In [139]:
to_nv_sorted2 = to_nv_sorted[to_nv_sorted["Labels"]==3]
to_nv_sorted2.head(15)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,Labels
0,"Adelaide,King,Richmond",Coffee Shop,Restaurant,Café,Thai Restaurant,Bar,Sushi Restaurant,Hotel,3
1,Berczy Park,Coffee Shop,Seafood Restaurant,Café,Bakery,Restaurant,Cocktail Bar,Cheese Shop,3
6,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Burger Joint,Juice Bar,Ice Cream Shop,Japanese Restaurant,3
10,"Commerce Court,Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,American Restaurant,Gym,Deli / Bodega,3
14,"Design Exchange,Toronto Dominion Centre",Coffee Shop,Café,Restaurant,Hotel,Bakery,Gastropub,Seafood Restaurant,3
16,"First Canadian Place,Underground city",Coffee Shop,Café,Restaurant,Gastropub,Seafood Restaurant,Steakhouse,Gym,3
20,"Harbourfront East,Toronto Islands,Union Station",Coffee Shop,Aquarium,Café,Hotel,Restaurant,Scenic Lookout,Italian Restaurant,3
23,"Little Portugal,Trinity",Bar,Coffee Shop,Asian Restaurant,Restaurant,Wine Bar,Vietnamese Restaurant,Café,3
31,"Ryerson,Garden District",Coffee Shop,Clothing Store,Café,Japanese Restaurant,Bubble Tea Shop,Middle Eastern Restaurant,Cosmetics Shop,3
32,St. James Town,Coffee Shop,Café,Restaurant,Hotel,Diner,Beer Bar,Cosmetics Shop,3
