# IBM Applied Data Science Capstone Projcet

### Week 3 Part 1 : Web scraping for Toronto neighborhood and build a clean dataframe
* Build a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name in Toronto.
* Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

### 1. Import libraries and all other dependencies

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

!conda install -c conda-forge bs4 --yes
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print("Libraries imported.")

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - bs4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.0       |   py36h9f0ad1d_0         160 KB  conda-forge
    bs4-4.9.0                  |                0           4 KB  conda-forge
    soupsieve-1.9.4            |   py36h9f0ad1d_1          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         222 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/linux-64::beautifulsoup4-4.9.0-py36h9f0ad1d_0
  bs4                conda-forge/

### 2. Scrap data from Wikipedia page into a DataFrame

In [5]:
# send the GET request
data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [6]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [7]:
# create three lists to store table data
postalCodeList = []
boroughList = []
neighborhoodList = []

In [8]:
# append the data into the respective lists
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if(len(cells) > 0):
        postalCodeList.append(cells[0].text.rstrip('\n'))
        boroughList.append(cells[1].text.rstrip('\n'))
        neighborhoodList.append(cells[2].text.rstrip('\n')) # avoid new lines in neighborhood cell

### 3. Create a panda dataframe

In [9]:
# create a new DataFrame from the three lists
toronto_df = pd.DataFrame({"PostalCode": postalCodeList,
                           "Borough": boroughList,
                           "Neighborhood": neighborhoodList})

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### 4. Drop cells with a borough that is "Not assigned"

In [10]:
# drop cells with a borough that is Not assigned
toronto_df_dropna = toronto_df[toronto_df.Borough != "Not assigned"].reset_index(drop=True)
toronto_df_dropna.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### 5. Group neighborhoods in the same borough separated with a comma

In [11]:
# group neighborhoods in the same borough
toronto_df_grouped = toronto_df_dropna.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### 6. For  Neighborhood = Not assigned, make the neighborhood will be the same as the borough.

In [12]:
# for Neighborhood="Not assigned", make the value the same as Borough
for index, row in toronto_df_grouped.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]
        
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### 7. Compare the dataframe with the expected datafarme

In [13]:
# create a new test dataframe
column_names = ["PostalCode", "Borough", "Neighborhood"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(toronto_df_grouped[toronto_df_grouped["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,Parkview Hill / Woodbine Gardens
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,Wexford / Maryvale
7,M9V,Etobicoke,South Steeles / Silverstone / Humbergate / Jam...
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,CN Tower / King and Spadina / Railway Lands / ...


### 8. Print the number of rows in cleaned dataframe

In [14]:
# print the number of rows of the cleaned dataframe
toronto_df_grouped.shape

(103, 3)

### Week 3 Part 2: Getting coordinates and add to the Toronto DataFrame
* Get the geographical coordinates of the neighborhoods in Toronto.
* After building a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

### 9.Load the coordinates from the csv file

In [15]:
!wget -q -O "toronto_coordinates.csv" http://cocl.us/Geospatial_data
print('Coordinates downloaded!')
coordinates = pd.read_csv('toronto_coordinates.csv')

Coordinates downloaded!


In [16]:
print(coordinates.shape)
coordinates.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
# rename the column "PostalCode"
coordinates.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### 10. Merge two dataframes to get the coordinates

In [18]:
# merge two table on the column "PostalCode"
toronto_df_new = toronto_df_grouped.merge(coordinates, on="PostalCode", how="left")
toronto_df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### 11. Finally, check to make sure the coordinates are added as required by the question

In [19]:
# create a new test dataframe
column_names = ["PostalCode", "Borough", "Neighborhood", "Latitude", "Longitude"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(toronto_df_new[toronto_df_new["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,Parkview Hill / Woodbine Gardens,43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,Wexford / Maryvale,43.750072,-79.295849
7,M9V,Etobicoke,South Steeles / Silverstone / Humbergate / Jam...,43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,CN Tower / King and Spadina / Railway Lands / ...,43.628947,-79.39442


### Week 3 Part 3: Explore and cluster the neighborhoods in Toronto

### 12. Use geopy library to get the latitude and longitude values of Toronto

In [20]:
address = 'Toronto'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### 13. Create a map of Toronto with neighborhoods superimposed on top

In [21]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

### 14.Filter only boroughs that contain the word Toronto

In [22]:
# filter borough names that contain the word Toronto
borough_names = list(toronto_df_new.Borough.unique())

borough_with_toronto = []

for x in borough_names:
    if "toronto" in x.lower():
        borough_with_toronto.append(x)
        
borough_with_toronto

['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']

In [23]:
# create a new DataFrame with only boroughs that contain the word Toronto
toronto_df_new = toronto_df_new[toronto_df_new['Borough'].isin(borough_with_toronto)].reset_index(drop=True)
print(toronto_df_new.shape)
toronto_df_new.head(10)

(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,The Danforth West / Riverdale,43.679557,-79.352188
2,M4L,East Toronto,India Bazaar / The Beaches West,43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,Moore Park / Summerhill East,43.689574,-79.38316
9,M4V,Central Toronto,Summerhill West / Rathnelly / South Hill / For...,43.686412,-79.400049


In [24]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

### 15. Use the Foursquare API to explore the neighborhoods

In [25]:
# define Foursquare Credentials and Version
CLIENT_ID = 'MSJPRISQSWZYGK3GCMT53RYSCNKZB3IEPBMAC1NBAXJSX4XL' # your Foursquare ID
CLIENT_SECRET = '5UQTFM4LIXSDXGKBQJXQOMEZZRKESNI4QQFMNBOCAOVLI0AD' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MSJPRISQSWZYGK3GCMT53RYSCNKZB3IEPBMAC1NBAXJSX4XL
CLIENT_SECRET:5UQTFM4LIXSDXGKBQJXQOMEZZRKESNI4QQFMNBOCAOVLI0AD


#### Now, let's get the top 100 venues that are within a radius of 500 meters.

In [47]:
radius= 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['PostalCode'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id=MSJPRISQSWZYGK3GCMT53RYSCNKZB3IEPBMAC1NBAXJSX4XL&client_secret=5UQTFM4LIXSDXGKBQJXQOMEZZRKESNI4QQFMNBOCAOVLI0AD&v=20180605&ll=43.6534817,-79.3839347&radius=500&limit=100".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [48]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(2769, 9)


Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Downtown Toronto,43.653232,-79.385296,Neighborhood
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,Nathan Phillips Square,43.65227,-79.383516,Plaza
2,M4E,East Toronto,The Beaches,43.676357,-79.293031,Eggspectation Bell Trinity Square,43.653144,-79.38198,Breakfast Spot
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,Japango,43.655268,-79.385165,Sushi Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,Indigo,43.653515,-79.380696,Bookstore


#### Let's check how many venues were returned for each PostalCode

In [49]:
venues_df.groupby(['PostalCode', 'Borough','Neighborhood']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
PostalCode,Borough,Neighborhood,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
M4E,East Toronto,The Beaches,71,71,71,71,71,71
M4K,East Toronto,The Danforth West / Riverdale,71,71,71,71,71,71
M4L,East Toronto,India Bazaar / The Beaches West,71,71,71,71,71,71
M4M,East Toronto,Studio District,71,71,71,71,71,71
M4N,Central Toronto,Lawrence Park,71,71,71,71,71,71
M4P,Central Toronto,Davisville North,71,71,71,71,71,71
M4R,Central Toronto,North Toronto West,71,71,71,71,71,71
M4S,Central Toronto,Davisville,71,71,71,71,71,71
M4T,Central Toronto,Moore Park / Summerhill East,71,71,71,71,71,71
M4V,Central Toronto,Summerhill West / Rathnelly / South Hill / Forest Hill SE / Deer Park,71,71,71,71,71,71


#### Let's find out how many unique categories can be curated from all the returned venues

In [50]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 50 uniques categories.


In [51]:
venues_df['VenueCategory'].unique()[:50]

array(['Neighborhood', 'Plaza', 'Breakfast Spot', 'Sushi Restaurant',
       'Bookstore', 'Cosmetics Shop', 'Bubble Tea Shop', 'Shopping Mall',
       'Restaurant', 'Coffee Shop', 'Concert Hall',
       'Fast Food Restaurant', 'Clothing Store', 'Ramen Restaurant',
       'Vegetarian / Vegan Restaurant', 'Theater', 'Hotel',
       'Japanese Restaurant', 'American Restaurant',
       'Furniture / Home Store', 'Seafood Restaurant', 'Opera House',
       'Comic Shop', 'Tanning Salon', 'Bar', 'Gastropub', 'Tea Room',
       'Modern European Restaurant', "Women's Store",
       'New American Restaurant', 'Steakhouse', 'Bank', 'Music Venue',
       'Latin American Restaurant', 'General Travel',
       'Gym / Fitness Center', 'Ice Cream Shop', 'Diner', 'Café',
       'Middle Eastern Restaurant', 'Juice Bar', 'Mexican Restaurant',
       'Shoe Store', 'Colombian Restaurant', 'Video Game Store',
       'Thai Restaurant', 'Movie Theater', 'Vietnamese Restaurant',
       'Salad Place', 'Cocktail B

### 16. Analyze Each Area

In [52]:
# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add postal, borough and neighborhood column back to dataframe
toronto_onehot['PostalCode'] = venues_df['PostalCode'] 
toronto_onehot['Borough'] = venues_df['Borough'] 
toronto_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move postal, borough and neighborhood column to the first column
fixed_columns = list(toronto_onehot.columns[-3:]) + list(toronto_onehot.columns[:-3])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2769, 53)


Unnamed: 0,PostalCode,Borough,Neighborhoods,American Restaurant,Bank,Bar,Bookstore,Breakfast Spot,Bubble Tea Shop,Café,Clothing Store,Cocktail Bar,Coffee Shop,Colombian Restaurant,Comic Shop,Concert Hall,Cosmetics Shop,Diner,Fast Food Restaurant,Furniture / Home Store,Gastropub,General Travel,Gym / Fitness Center,Hotel,Ice Cream Shop,Japanese Restaurant,Juice Bar,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Movie Theater,Music Venue,Neighborhood,New American Restaurant,Opera House,Plaza,Ramen Restaurant,Restaurant,Salad Place,Seafood Restaurant,Shoe Store,Shopping Mall,Steakhouse,Sushi Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Women's Store
0,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M4E,East Toronto,The Beaches,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,M4E,East Toronto,The Beaches,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [53]:
toronto_grouped = toronto_onehot.groupby(["PostalCode", "Borough", "Neighborhoods"]).mean().reset_index()

print(toronto_grouped.shape)
toronto_grouped

(39, 53)


Unnamed: 0,PostalCode,Borough,Neighborhoods,American Restaurant,Bank,Bar,Bookstore,Breakfast Spot,Bubble Tea Shop,Café,Clothing Store,Cocktail Bar,Coffee Shop,Colombian Restaurant,Comic Shop,Concert Hall,Cosmetics Shop,Diner,Fast Food Restaurant,Furniture / Home Store,Gastropub,General Travel,Gym / Fitness Center,Hotel,Ice Cream Shop,Japanese Restaurant,Juice Bar,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Movie Theater,Music Venue,Neighborhood,New American Restaurant,Opera House,Plaza,Ramen Restaurant,Restaurant,Salad Place,Seafood Restaurant,Shoe Store,Shopping Mall,Steakhouse,Sushi Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Women's Store
0,M4E,East Toronto,The Beaches,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
1,M4K,East Toronto,The Danforth West / Riverdale,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
2,M4L,East Toronto,India Bazaar / The Beaches West,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
3,M4M,East Toronto,Studio District,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
4,M4N,Central Toronto,Lawrence Park,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
5,M4P,Central Toronto,Davisville North,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
6,M4R,Central Toronto,North Toronto West,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
7,M4S,Central Toronto,Davisville,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
8,M4T,Central Toronto,Moore Park / Summerhill East,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085
9,M4V,Central Toronto,Summerhill West / Rathnelly / South Hill / For...,0.042254,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.084507,0.014085,0.070423,0.014085,0.014085,0.014085,0.028169,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.042254,0.014085,0.028169,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085,0.028169,0.014085,0.014085,0.014085,0.014085


#### Now let's create the new dataframe and display the top 10 venues for each PostalCode.

In [54]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['PostalCode', 'Borough', 'Neighborhoods']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']
neighborhoods_venues_sorted['Borough'] = toronto_grouped['Borough']
neighborhoods_venues_sorted['Neighborhoods'] = toronto_grouped['Neighborhoods']

for ind in np.arange(toronto_grouped.shape[0]):
    row_categories = toronto_grouped.iloc[ind, :].iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    neighborhoods_venues_sorted.iloc[ind, 3:] = row_categories_sorted.index.values[0:num_top_venues]

# neighborhoods_venues_sorted.sort_values(freqColumns, inplace=True)
print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted

(39, 13)


Unnamed: 0,PostalCode,Borough,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
1,M4K,East Toronto,The Danforth West / Riverdale,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
2,M4L,East Toronto,India Bazaar / The Beaches West,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
3,M4M,East Toronto,Studio District,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
4,M4N,Central Toronto,Lawrence Park,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
5,M4P,Central Toronto,Davisville North,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
6,M4R,Central Toronto,North Toronto West,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
7,M4S,Central Toronto,Davisville,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
8,M4T,Central Toronto,Moore Park / Summerhill East,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
9,M4V,Central Toronto,Summerhill West / Rathnelly / South Hill / For...,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel


### 17. Cluster Areas : Run k-means to cluster the Toronto areas into 5 clusters.

In [55]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop(["PostalCode", "Borough", "Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

  return_n_iter=True)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [56]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
toronto_merged = toronto_df_new.copy()

# add clustering labels
toronto_merged["Cluster Labels"] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.drop(["Borough", "Neighborhoods"], 1).set_index("PostalCode"), on="PostalCode")

print(toronto_merged.shape)
toronto_merged.head() # check the last columns!

(39, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
1,M4K,East Toronto,The Danforth West / Riverdale,43.679557,-79.352188,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
2,M4L,East Toronto,India Bazaar / The Beaches West,43.668999,-79.315572,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel


In [57]:
# sort the results by Cluster Labels
print(toronto_merged.shape)
toronto_merged.sort_values(["Cluster Labels"], inplace=True)
toronto_merged

(39, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
21,M5L,Downtown Toronto,Commerce Court / Victoria Hotel,43.648198,-79.379817,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
22,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
23,M5P,Central Toronto,Forest Hill North & West,43.696948,-79.411307,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
24,M5R,Central Toronto,The Annex / North Midtown / Yorkville,43.67271,-79.405678,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
25,M5S,Downtown Toronto,University of Toronto / Harbord,43.662696,-79.400049,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
26,M5T,Downtown Toronto,Kensington Market / Chinatown / Grange Park,43.653206,-79.400049,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
27,M5V,Downtown Toronto,CN Tower / King and Spadina / Railway Lands / ...,43.628947,-79.39442,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
20,M5K,Downtown Toronto,Toronto Dominion Centre / Design Exchange,43.647177,-79.381576,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
28,M5W,Downtown Toronto,Stn A PO Boxes,43.646435,-79.374846,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel


#### Finally, let's visualize the resulting clusters

In [58]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, post, bor, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    #label = folium.Popup('{} ({}): {} - Cluster {}'.format(bor, post, poi, cluster), parse_html=True)
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 17. Analyse the Clusters

#### Cluster 1

In [59]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
21,Downtown Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
22,Central Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
23,Central Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
24,Central Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
25,Downtown Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
26,Downtown Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
27,Downtown Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
20,Downtown Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel
28,Downtown Toronto,0,Clothing Store,Coffee Shop,Restaurant,American Restaurant,Theater,Seafood Restaurant,Diner,Cosmetics Shop,Plaza,Hotel


#### Cluster 2 

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


#### Cluster 3

In [64]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


#### Cluster 4

In [65]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


#### Cluster 5

In [66]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
